Databricks Community

alexacas · ‎09-21-2024

Hi everyone,

I’m relatively new to Databricks and trying to optimize some SQL queries for better performance. I’ve noticed that certain queries take longer to run than expected. Does anyone have tips or best practices for writing efficient SQL in Databricks? Specifically, I’m interested in how to handle large datasets and any strategies for indexing or partitioning data effectively.

surcharge

miocado net menu wgu student

mhiltner · ‎09-21-2024

You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric...

And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide

My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy.

View solution in original post

mhiltner · ‎09-21-2024

You can find some tips here: https://community.databricks.com/t5/technical-blog/top-10-query-performance-tuning-tips-for-databric...

And here: https://www.databricks.com/discover/pages/optimize-data-workloads-guide

My overall recommendation would be to check the query performance window and find which processes are taking the longest. Than you can understand whether a broadcast would help, or repartitioning or any other strategy.

filipniziol · ‎09-22-2024

Hi @alexacas ,

The best thing is to share the queries and table structures 🙂

But my general approach is:
1. Use partitioning/zordering, or if you can upgrade runtime to 15.4, use liquid clustering, that is the new optimization technique.

2. Make sure you do not have many small files. Run DESCRIBE DETAIL on your tables to check if the files are of around 128 MB. If they are not, make sure to have maintenance job to run OPTIMIZE on your tables on regular basis.

lowedirect · ‎04-17-2025

When working with large datasets in Databricks SQL, here are some practical tips to boost performance:

Leverage Partitioning: Partition large Delta tables on columns with high cardinality and frequent filtering (like date or region). It helps Databricks skip irrelevant data during reads.
**Avoid SELECT *: Be explicit with the columns you need—pulling only what you use reduces I/O and speeds things up.
Use Delta Lake: If you’re not already, use Delta format—it supports efficient updates, ACID transactions, and optimization features like OPTIMIZE and ZORDER.
Broadcast Joins: For small lookup tables, use broadcast joins (broadcast(table)) to avoid shuffling huge datasets.
Caching Smartly: Cache intermediate results only when reused multiple times, and always clear when no longer needed to free up memory.
Analyze & Optimize: Use EXPLAIN to understand the query plan and OPTIMIZE with ZORDER BY on frequently filtered columns for faster retrieval.

Hope that helps! Let me know what kind of queries or data you’re working with—I can offer more tailored tips.

Databricks Community

Help with Databricks SQL Queries

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.