cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Bangalore
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Optimizing Delta Lake Performance in Databricks: Tips and Best Practices

Kaniz
Community Manager
Community Manager

Hello Databricks Bangalore User Group!

I hope you all are enjoying the discussions and learning opportunities within our community. I have a question that I would love to get your insights on:

In the context of Databricks, can you share your experience and best practices in optimizing the performance of Delta Lake for large-scale data processing? Specifically, what techniques or configurations have you found most effective in improving query performance and reducing the data processing time?

Your insights and experiences will be invaluable for our community members, especially those working with large-scale data in Databricks.

We look forward to hearing your thoughts and learning from your expertise!

3 REPLIES 3

swethaNandan
New Contributor III
New Contributor III

Hi Kaniz,

There are alot of techniques to improve query performance.

Some table designs that can be considered to improve query performance are as below:

  1. Compaction

We have seen performance issue with having a large number of small files like not getting a good input split or taking a long time to list files. We can reduce number of small files through compaction

There are two options for compaction - enabling auto optimize along with your data loading jobs and running manual optimize. The databricks recommends having more or less uniform size for the data files within a partition

Running optimize generates bigger files after combining smaller files - maximum file size after compaction is controlled by spark.databricks.delta.optimize.maxFileSize. Default - 1 GB

Auto optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Auto optimize adds latency overhead to write operations but accelerates read operations

2. Data Skipping

Data skipping helps in eliminating full table scans in spark jobs by pushing down predicates and skipping data files that do not contain the predicates. File pruning takes place as a metadata-operation as part of the SCAN operator in delta lake. This can be achieved because Delta Lake automatically collects metadata about data files managed by it.

Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. 

Also, collecting stats for columns is an expensive operation and you might need to configure delta.dataSkippingNumIndexedCols to add all the columns on which stats need to be collected and also avoid collecting stats on long strings which can take time

You can partition and zorder your tables in addition to the default data skipping provided automatically by databricks to reduce the total number of files scanned. These are also types of data skipping mechanisms in a way.

Dynamic file pruning is a type of data skipping and it allows for file pruning both with explicit literals in filter conditions and with the not so explicit predicates that are specified as part of a join, as is commonly found in most data warehouse queries.

The performance impact of dynamic file pruning is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit.

Please find more details on the above and recommendations in the below links

Dataskipping - https://docs.databricks.com/delta/data-skipping.html

Dynamic file pruning - https://docs.databricks.com/optimizations/dynamic-file-pruning.html

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-prunin...

Auto optimize - https://docs.databricks.com/optimizations/auto-optimize.html

Optimize - https://docs.databricks.com/sql/language-manual/delta-optimize.html

Please find some query level performance tuning :

Considering Broadcast join hints if one of the tables to join is smaller and will fit in memory.

Making sure that the statistics of the tables are up to date by running 'ANALYZE TABLE COMPUTE STATISTICS' ref - https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html

Increasing number of tasks in a stage to decrease spills by - Running with lower input split size - spark.sql.files.maxPartitionBytes so that there are more number of input stage tasks( more parallelism) and increasing shuffle stage parallelism (ref : https://spark.apache.org/docs/latest/sql-performance-tuning.html)

Hope these pointers helped!

Saivarshini
New Contributor II

Priyag1
Honored Contributor II

Thanks kaniz

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.