cancel
Showing results for 
Search instead for 
Did you mean: 
Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Optimization Tips – What’s Your Secret?

shraddha_09
New Contributor II

When I first started working with Databricks, I was genuinely impressed by its potential. The seamless integration with Delta Lake, the power of PySpark, and the ability to process massive datasets at incredible speeds—it was truly impactful.

Over time, I’ve picked up a few optimization techniques that made a real difference:

  • Caching data at the right stages: Strategically caching DataFrames helped reduce redundant processing and sped up execution time.

  • Broadcast joins for smaller datasets: Using broadcast() during joins with smaller tables significantly reduced shuffle times.

  • Avoiding UDFs unless absolutely necessary: I realized that native PySpark functions are almost always faster. UDFs should be a last resort.

  • Optimizing file formats: Switching from CSV to Parquet cut down processing time and storage costs—an absolute lifesaver.

  • Partitioning for faster reads: Properly partitioning data not only improved read times but also made querying far more efficient.

One of the biggest impacts I saw was during a project for an insurance client where optimizing PySpark scripts and enhancing SQL queries reduced processing times by nearly 40%. That kind of improvement is not just about faster jobs; it’s about freeing up resources for more innovative work.

I’d love to hear from the community—what are your best tips for getting the most out of Databricks? Are there any hidden optimizations you’ve discovered along the way? Let’s share our best practices and learn from each other!

1 REPLY 1

chanukya-pekala
Contributor

1. Try to remove cache() and persist() in the dataframe operations in the code base.

2. Fully avoid driver operations like collect() and take() - the information from the executors are brought back to driver, which is highly network i/o overhead.

3. Avoid partitioning on the smaller and medium size datasets - Databricks takes care of the datasets just by adding auto liquid clustering.

4. Avoid driver based pandas code in spark codebase - it would be then a driver based operation and you are not underutilizing distributed processing, convert to spark dataframes, and take the full power of executors.

5. Huge DAGs - Try to persist the intermediate steps to temp table and hence data gets persisted. What will happen then is Huge DAGs will no longer collect all the information in memory, rather with the tiny DAGs makes the code easily debuggable and smooth runs reducing overhead.

6. Avoid photon - Its actually not needed to use photon for every query. Its so costly compared to all purpose cluster. SQL warehouse and Serverless by default uses photon engine, but while working on notebooks with all-purpose, then you dont need it most probably.

7. Avoid DLT - Its pricing is so high compared to stream-stream join on all purpose or job compute, Create a DLT, make it work, when you think data is getting expected results, try to replicate setup as a Job with schedule, its actually same but way too cheap, and you have control over the checkpoints and failure

Chanukya

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now