cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

9 Powerful 🚀 Spark Optimization Techniques in Databricks (With Real Examples)

savlahanish27
New Contributor II

📘 Introduction

One of our ETL pipelines used to take 10 hours to complete. After tuning and scaling in Databricks, it finished in just about 1 hour — a 90% reduction in runtime.

That’s the power of Spark tuning.

Databricks, built on Apache Spark, is a powerful platform for big data, machine learning, and real-time analytics. But without the right optimizations, Spark jobs can quickly become slow, expensive, and hard to scale.

In this guide, we explore 9 proven optimization techniques for Databricks Spark — from autoscaling clusters and smart partitioning to Delta Lake tuning and adaptive execution.

Whether you’re running:

  •  ETL pipelines
  • 🤖 Machine learning models
  • 📊 Real-time analytics

These techniques will help you:

  • Speed up queries and transformations
  • Reduce cloud costs significantly
  • Build more scalable and reliable pipelines

Backed by real-world datasets (hundreds of millions of rows, up to 500TB in volume), these techniques have delivered 5×–10× speedups in production pipelines while cutting costs significantly.

 
Press enter or click to view image in full size
savlahanish27_1-1758713623697.png

 

💡 These 9 techniques together can make Spark pipelines run 5–10× faster and cut cloud costs by 30%+.

 

1. Cluster & Resource Optimization

Why It Matters:
The compute cluster is the engine that runs your Spark jobs. Misconfigured clusters (too small, too large, wrong node types) can result in slow jobs or high costs.

Best Practices:

  • Use Autoscaling clusters to handle variable workloads without overprovisioning.
  • Enable Photon runtime for SQL and Delta acceleration.
  • Use cluster pools to reduce start-up time.
  • Use job clusters for production workloads, and all-purpose clusters for notebooks.

⚠️ Pitfalls to Avoid:

  • Overprovisioning → unnecessarily high cloud costs.
  • Not enabling Photon when workloads are SQL/Delta heavy.

Example: Configuring an Autoscaling Job Cluster

{
"autoscale": {
"min_workers": 2,
"max_workers": 10
},
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"runtime_engine": "PHOTON"
}

A transaction aggregation job that initially took 40 minutes on a fixed 2-node cluster completed in just 12 minutes after enabling autoscaling with Photon.

 

2. Partitioning Strategy

Why It Matters:
Efficient data partitioning improves parallelism, reduces I/O, and speeds up queries. Without partitioning, Spark may scan entire datasets unnecessarily.

Best Practices:

  • Partition by columns frequently used in filters (e.g., transaction_date, region).
  • Avoid over-partitioning, which leads to small files and overhead.
  • Repartition large DataFrames before expensive operations like joins or writes.

⚠️ Pitfalls to Avoid:

  • Too many partitions → metadata overhead + small files problem.
  • Partitioning on low-cardinality columns (e.g., gender) → no benefit.

Example: Partitioning Transaction Data by Date

df.write.format("delta") \
.partitionBy("year", "month", "day") \
.mode("overwrite") \
.save("/mnt/delta/transactions")

Queries like:

SELECT * FROM transactions WHERE year = 2025 AND month = 8

now only scan a small portion of the data instead of the entire dataset.

 

3. Data Caching & Persistence

Why It Matters:
Recomputing DataFrames in memory-intensive jobs can be expensive. Caching avoids repeated reads from storage, accelerating interactive and iterative workloads.

Best Practices:

  • Use .cache() for DataFrames reused multiple times in memory.
  • Use .persist(StorageLevel.DISK_ONLY) if memory is limited.
  • Trigger caching with an action (e.g., count()).

⚠️ Pitfalls to Avoid:

  • Caching very large datasets without enough memory → job failures.
  • Forgetting to unpersist() unused cached DataFrames → memory leaks.

Example: Caching a Reused Dataset

df = spark.read.format("delta").load("/mnt/delta/transactions")
df.cache()
df.count() # Action triggers cache

Training a machine learning model on cached data reduced iteration time by 70%.

 

4. Data Compression & File Formats

Why It Matters:
The format and compression of your files affect both storage costs and I/O performance. CSVs are large and inefficient; Delta and Parquet are optimized for Spark.

Best Practices:

  • Always store data in Delta or Parquet; avoid CSVs in production.
  • Use Snappy or ZSTD compression.
  • Use columnar formats for efficient reads.

⚠️ Pitfalls to Avoid:

  • Using CSV/JSON in production → huge storage + slow reads.
  • Over-compression (e.g., GZIP) → smaller files but slower decompression.

Example: Writing Delta with ZSTD Compression

df.write.format("delta") \
.option("compression", "zstd") \
.save("/mnt/delta/transactions")

A 500 GB CSV dataset compressed to 150 GB in Delta + ZSTD, while query performance improved by .

 

5. Delta Lake Optimization

Why It Matters:
Delta Lake enables ACID transactions and scalable data lakes. Over time, frequent updates can create many small files, which slow down queries.

Best Practices:

  • Use OPTIMIZE to compact small files.
  • Use ZORDER BY for faster queries on filter columns.
  • Run VACUUM regularly to clean up old files.

⚠️ Pitfalls to Avoid:

  • Not optimizing → file sprawl and degraded performance.
  • Running VACUUM with too aggressive retention → accidental data loss.

Example: Optimizing and Z-Ordering

OPTIMIZE transactions ZORDER BY (customer_id);
VACUUM transactions RETAIN 168 HOURS;

A customer lookup query improved from 20 minutes → 3 minutes after file compaction and Z-Ordering.

 

6. Join Optimization with Broadcast Joins

Why It Matters:
Standard joins shuffle large amounts of data. If one table is small, broadcasting it can eliminate shuffle and speed up the job significantly.

Best Practices:

  • Broadcast tables smaller than 10MB.
  • Use broadcast() in PySpark for explicit joins.
  • Tune spark.sql.autoBroadcastJoinThreshold.

⚠️ Pitfalls to Avoid:

  • Broadcasting too-large tables → OutOfMemory errors.
  • Forcing broadcast when Spark’s optimizer would do better.

Example: Broadcasting a Small Dimension Table

from pyspark.sql.functions import broadcast

df_large = spark.read.parquet("/mnt/delta/transactions")
df_small = spark.read.parquet("/mnt/delta/customers")
df_result = df_large.join(broadcast(df_small), "customer_id")

Join runtime reduced from 25 minutes → 6 minutes.

 

7. Skew & Shuffle Optimization

Why It Matters:
Skewed data (e.g., one customer with millions of transactions) causes uneven task distribution, leading to long-running or failed jobs.

Best Practices:

  • Use salting to spread skewed keys.
  • Tune spark.sql.shuffle.partitions.
  • Avoid wide transformations before filtering.

⚠️ Pitfalls to Avoid:

  • Not adjusting shuffle partitions → too many or too few tasks.
  • Adding salt without consistent logic → broken joins.

Example: Skew Fix with Salting

from pyspark.sql.functions import rand, col

df1 = df1.withColumn("salt", (rand() * 10).cast("int"))
df2 = df2.withColumn("salt", col("customer_id") % 10)
df_joined = df1.join(df2, ["customer_id", "salt"])

This approach distributed skewed keys evenly and improved performance dramatically.

 

8. Structured Streaming Optimization

Why It Matters:
Streaming workloads need consistent low-latency processing. Without proper configuration, ingestion delays and data loss may occur.

Best Practices:

  • Use Auto Loader for scalable ingestion.
  • Always configure checkpointing.
  • Write to Delta Lake for upserts and compaction.

⚠️ Pitfalls to Avoid:

  • Not configuring checkpoints → risk of duplicate/missing data.
  • Using JSON/CSV for streaming input → poor performance.

Example: Streaming with Auto Loader and Checkpoint

df_stream = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.load("/mnt/raw/transactions")


df_stream.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/checkpoints/transactions") \
.start("/mnt/delta/transactions")

Latency reduced from 30 minutes (batch) → under 5 minutes with streaming.

 

9. Query Tuning & Adaptive Execution (AQE)

Why It Matters:
Static query plans can’t adapt to data size variations. AQE dynamically adjusts plans at runtime for better efficiency.

Best Practices:

  • Enable AQE for all workloads.
  • Let Spark decide join strategies & partition coalescing.
  • Combine AQE with Z-Ordering + partitioning.

⚠️ Pitfalls to Avoid:

  • Forgetting to enable AQE → static inefficient plans.
  • Overriding AQE decisions with manual hints.

Example: Enabling AQE in Spark

spark.conf.set("spark.sql.adaptive.enabled", "true")

A poorly optimized join was reduced from 8 minutes → 2 minutes by enabling AQE.

 

Conclusion

Databricks offers unmatched power and flexibility — but with great power comes great responsibility.

By applying these techniques:

  • Query times can improve by 5× to 10×
  • Costs can drop significantly via autoscaling & compression
  • Reliability increases with partitioning & caching
  • Streaming pipelines deliver near real-time analytics

These optimization methods are production-proven and scalable, helping you get the most value out of your Spark infrastructure on Databricks.

1 REPLY 1

Advika
Databricks Employee
Databricks Employee

This is a fantastic breakdown of Spark optimization techniques, @savlahanish27!
Definitely helpful for anyone working on performance tuning in Databricks.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now