โ09-21-2024 03:27 AM - edited โ09-21-2024 03:31 AM
Hi everyone,
Iโm relatively new to Databricks and trying to optimize some SQL queries for better performance. Iโve noticed that certain queries take longer to run than expected. Does anyone have tips or best practices for writing efficient SQL in Databricks? Specifically, Iโm interested in how to handle large datasets and any strategies for indexing or partitioning data effectively.
โ09-21-2024 08:24 AM
You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric...
And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide
My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy.
โ09-21-2024 08:24 AM
You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric...
And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide
My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy.
โ09-22-2024 09:30 AM
Hi @alexacas ,
The best thing is to share the queries and table structures ๐
But my general approach is:
1. Use partitioning/zordering, or if you can upgrade runtime to 15.4, use liquid clustering, that is the new optimization technique.
2. Make sure you do not have many small files. Run DESCRIBE DETAIL on your tables to check if the files are of around 128 MB. If they are not, make sure to have maintenance job to run OPTIMIZE on your tables on regular basis.
โ04-17-2025 12:14 AM
When working with large datasets in Databricks SQL, here are some practical tips to boost performance:
Leverage Partitioning: Partition large Delta tables on columns with high cardinality and frequent filtering (like date or region). It helps Databricks skip irrelevant data during reads.
**Avoid SELECT *: Be explicit with the columns you needโpulling only what you use reduces I/O and speeds things up.
Use Delta Lake: If youโre not already, use Delta formatโit supports efficient updates, ACID transactions, and optimization features like OPTIMIZE and ZORDER.
Broadcast Joins: For small lookup tables, use broadcast joins (broadcast(table)) to avoid shuffling huge datasets.
Caching Smartly: Cache intermediate results only when reused multiple times, and always clear when no longer needed to free up memory.
Analyze & Optimize: Use EXPLAIN to understand the query plan and OPTIMIZE with ZORDER BY on frequently filtered columns for faster retrieval.
Hope that helps! Let me know what kind of queries or data youโre working withโI can offer more tailored tips.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now