cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Huge Delta table performance consideration

Phani1
Valued Contributor II

Hi Team,

We want to create a delta table which have historical load of 10 TB of data, and we expect an incremental refresh of about 15 GB each day.

What factors should we take into account for managing such a large volume of data especially cost and performance wise?

I have a below things on my mind, please let me know if there's anything else we should consider.

1) We should have a good partition strategy in place, and we can also think about using Liquid clustering if we are unsure about the size of each partition.
2) It's important to do regular housekeeping tasks like vacuuming and z-ordering every week. Depending on how well the system is performing, we can adjust the frequency of these tasks to every other day.
3) Move any historical data older than 7 years into a separate table (like an Archive table) and focus on transactions with the active table. This decision should be based on the business needs and whether all the data is necessary for calculations.
4) When setting up the delta table, make sure to prioritize the most frequently used columns by listing them in the first 32 columns.

Could you please suggest any other performance factors needs to consider for such a big table

Regards,

Janga

 

1 REPLY 1

szymon_dybczak
Contributor III

@Phani1 ,

All that you've mentioned is correct. Additionally, if you have scenario which require DELETE, UPDATE or MERGE you can turn on deletion vecors:

Deletion vectors are a storage optimization feature that can be enabled on Delta Lake tables. By default, when a single row in a data file is deleted, the entire Parquet file containing the record must be rewritten. With deletion vectors enabled for the table, DELETEUPDATE, and MERGE operations use deletion vectors to mark existing rows as removed or changed without rewriting the Parquet file. Subsequent reads on the table resolve current table state by applying the deletions noted by deletion vectors to the most recent table version.

What are deletion vectors? | Databricks on AWS

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group