cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Help with Databricks SQL Queries

alexacas
New Contributor II

Hi everyone,

I’m relatively new to Databricks and trying to optimize some SQL queries for better performance. I’ve noticed that certain queries take longer to run than expected. Does anyone have tips or best practices for writing efficient SQL in Databricks? Specifically, I’m interested in how to handle large datasets and any strategies for indexing or partitioning data effectively.

surcharge

1 ACCEPTED SOLUTION

Accepted Solutions

mhiltner
Databricks Employee
Databricks Employee

You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric... 

And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide 

My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy. 

View solution in original post

3 REPLIES 3

mhiltner
Databricks Employee
Databricks Employee

You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric... 

And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide 

My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy. 

filipniziol
Esteemed Contributor

Hi @alexacas ,

The best thing is to share the queries and table structures 🙂

But my general approach is:
1. Use partitioning/zordering, or if you can upgrade runtime to 15.4, use liquid clustering, that is the new optimization technique.

2. Make sure you do not have many small files. Run DESCRIBE DETAIL on your tables to check if the files are of around 128 MB. If they are not, make sure to have maintenance job to run OPTIMIZE on your tables on regular basis. 

 

lowedirect
New Contributor

When working with large datasets in Databricks SQL, here are some practical tips to boost performance:

  1. Leverage Partitioning: Partition large Delta tables on columns with high cardinality and frequent filtering (like date or region). It helps Databricks skip irrelevant data during reads.

  2. **Avoid SELECT *: Be explicit with the columns you need—pulling only what you use reduces I/O and speeds things up.

  3. Use Delta Lake: If you’re not already, use Delta format—it supports efficient updates, ACID transactions, and optimization features like OPTIMIZE and ZORDER.

  4. Broadcast Joins: For small lookup tables, use broadcast joins (broadcast(table)) to avoid shuffling huge datasets.

  5. Caching Smartly: Cache intermediate results only when reused multiple times, and always clear when no longer needed to free up memory.

  6. Analyze & Optimize: Use EXPLAIN to understand the query plan and OPTIMIZE with ZORDER BY on frequently filtered columns for faster retrieval.

Hope that helps! Let me know what kind of queries or data you’re working with—I can offer more tailored tips.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now