Databricks

Kaniz · ‎03-31-2023

Hello Databricks Bangalore User Group!

I hope you all are enjoying the discussions and learning opportunities within our community. I have a question that I would love to get your insights on:

In the context of Databricks, can you share your experience and best practices in optimizing the performance of Delta Lake for large-scale data processing? Specifically, what techniques or configurations have you found most effective in improving query performance and reducing the data processing time?

Your insights and experiences will be invaluable for our community members, especially those working with large-scale data in Databricks.

We look forward to hearing your thoughts and learning from your expertise!

swethaNandan · ‎04-04-2023

Hi Kaniz,

There are alot of techniques to improve query performance.

Some table designs that can be considered to improve query performance are as below:

Compaction

We have seen performance issue with having a large number of small files like not getting a good input split or taking a long time to list files. We can reduce number of small files through compaction

There are two options for compaction - enabling auto optimize along with your data loading jobs and running manual optimize. The databricks recommends having more or less uniform size for the data files within a partition

Running optimize generates bigger files after combining smaller files - maximum file size after compaction is controlled by spark.databricks.delta.optimize.maxFileSize. Default - 1 GB

Auto optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Auto optimize adds latency overhead to write operations but accelerates read operations

2. Data Skipping

Data skipping helps in eliminating full table scans in spark jobs by pushing down predicates and skipping data files that do not contain the predicates. File pruning takes place as a metadata-operation as part of the SCAN operator in delta lake. This can be achieved because Delta Lake automatically collects metadata about data files managed by it.

Data skipping information is collected automatically when you write data into a Delta table. Delta Lake on Databricks takes advantage of this information (minimum and maximum values) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data.

Also, collecting stats for columns is an expensive operation and you might need to configure delta.dataSkippingNumIndexedCols to add all the columns on which stats need to be collected and also avoid collecting stats on long strings which can take time

You can partition and zorder your tables in addition to the default data skipping provided automatically by databricks to reduce the total number of files scanned. These are also types of data skipping mechanisms in a way.

Dynamic file pruning is a type of data skipping and it allows for file pruning both with explicit literals in filter conditions and with the not so explicit predicates that are specified as part of a join, as is commonly found in most data warehouse queries.

The performance impact of dynamic file pruning is often correlated to the clustering of data so consider using Z-Ordering to maximize the benefit.

Please find more details on the above and recommendations in the below links

Dataskipping - https://docs.databricks.com/delta/data-skipping.html

Dynamic file pruning - https://docs.databricks.com/optimizations/dynamic-file-pruning.html

https://www.databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-prunin...

Auto optimize - https://docs.databricks.com/optimizations/auto-optimize.html

Optimize - https://docs.databricks.com/sql/language-manual/delta-optimize.html

Please find some query level performance tuning :

Considering Broadcast join hints if one of the tables to join is smaller and will fit in memory.

Making sure that the statistics of the tables are up to date by running 'ANALYZE TABLE COMPUTE STATISTICS' ref - https://docs.databricks.com/sql/language-manual/sql-ref-syntax-aux-analyze-table.html

Increasing number of tasks in a stage to decrease spills by - Running with lower input split size - spark.sql.files.maxPartitionBytes so that there are more number of input stage tasks( more parallelism) and increasing shuffle stage parallelism (ref : https://spark.apache.org/docs/latest/sql-performance-tuning.html)

Hope these pointers helped!

Saivarshini · ‎04-06-2023

Hi Kaniz. Can you answer my question here - https://community.databricks.com/s/question/0D58Y0000A9IxosSQC/community-edition-login-issues

Priyag1 · ‎05-02-2023

Thanks kaniz

Databricks

Optimizing Delta Lake Performance in Databricks: Tips and Best Practices

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI