- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-21-2024 03:27 AM - edited 09-21-2024 03:31 AM
Hi everyone,
I’m relatively new to Databricks and trying to optimize some SQL queries for better performance. I’ve noticed that certain queries take longer to run than expected. Does anyone have tips or best practices for writing efficient SQL in Databricks? Specifically, I’m interested in how to handle large datasets and any strategies for indexing or partitioning data effectively.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-21-2024 08:24 AM
You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric...
And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide
My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-21-2024 08:24 AM
You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric...
And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide
My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-22-2024 09:30 AM
Hi @alexacas ,
The best thing is to share the queries and table structures 🙂
But my general approach is:
1. Use partitioning/zordering, or if you can upgrade runtime to 15.4, use liquid clustering, that is the new optimization technique.
2. Make sure you do not have many small files. Run DESCRIBE DETAIL on your tables to check if the files are of around 128 MB. If they are not, make sure to have maintenance job to run OPTIMIZE on your tables on regular basis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thursday
When working with large datasets in Databricks SQL, here are some practical tips to boost performance:
Leverage Partitioning: Partition large Delta tables on columns with high cardinality and frequent filtering (like date or region). It helps Databricks skip irrelevant data during reads.
**Avoid SELECT *: Be explicit with the columns you need—pulling only what you use reduces I/O and speeds things up.
Use Delta Lake: If you’re not already, use Delta format—it supports efficient updates, ACID transactions, and optimization features like OPTIMIZE and ZORDER.
Broadcast Joins: For small lookup tables, use broadcast joins (broadcast(table)) to avoid shuffling huge datasets.
Caching Smartly: Cache intermediate results only when reused multiple times, and always clear when no longer needed to free up memory.
Analyze & Optimize: Use EXPLAIN to understand the query plan and OPTIMIZE with ZORDER BY on frequently filtered columns for faster retrieval.
Hope that helps! Let me know what kind of queries or data you’re working with—I can offer more tailored tips.

