Databricks Community

Phani1 · ‎07-29-2024

Hi Team,

We want to create a delta table which have historical load of 10 TB of data, and we expect an incremental refresh of about 15 GB each day.

What factors should we take into account for managing such a large volume of data especially cost and performance wise?

I have a below things on my mind, please let me know if there's anything else we should consider.

1) We should have a good partition strategy in place, and we can also think about using Liquid clustering if we are unsure about the size of each partition.
2) It's important to do regular housekeeping tasks like vacuuming and z-ordering every week. Depending on how well the system is performing, we can adjust the frequency of these tasks to every other day.
3) Move any historical data older than 7 years into a separate table (like an Archive table) and focus on transactions with the active table. This decision should be based on the business needs and whether all the data is necessary for calculations.
4) When setting up the delta table, make sure to prioritize the most frequently used columns by listing them in the first 32 columns.

Could you please suggest any other performance factors needs to consider for such a big table

Regards,

Janga

szymon_dybczak · ‎07-29-2024

@Phani1 ,

All that you've mentioned is correct. Additionally, if you have scenario which require DELETE, UPDATE or MERGE you can turn on deletion vecors:

Deletion vectors are a storage optimization feature that can be enabled on Delta Lake tables. By default, when a single row in a data file is deleted, the entire Parquet file containing the record must be rewritten. With deletion vectors enabled for the table, DELETE, UPDATE, and MERGE operations use deletion vectors to mark existing rows as removed or changed without rewriting the Parquet file. Subsequent reads on the table resolve current table state by applying the deletions noted by deletion vectors to the most recent table version.

What are deletion vectors? | Databricks on AWS

Databricks Community

Huge Delta table performance consideration

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences