Databricks Community

Suheb · yesterday

If you are building data pipelines in Databricks (where data is Extracted, Transformed, and Loaded), what tips, methods, or best practices do you use to make those pipelines run faster, cheaper, and more efficiently?

bianca_unifeye · 8 hours ago

When I think about optimising ETL on the Databricks Lakehouse, I split it into four layers: data layout, Spark/SQL design, platform configuration, and operational excellence.

And above all: you are not building pipelines for yourself, you are building them for the business to consume. Keeping this in mind changes how you design for reliability, consistency, and performance.

Start with the right architecture: medallion + incremental

Avoid building a single, complex pipeline that tries to do everything.

Medallion layers

Bronze: raw, ingested as-is
Silver: cleaned and conformed
Gold: business-ready marts and aggregates

This keeps each step simpler, easier to test, and cheaper to rerun.

Always design for incremental processing

Use timestamp columns (e.g., ingestion_date, updated_at).
With declarative pipelines (Lakeflow / former DLT), prefer built-in CDC capabilities instead of full reloads.
For batch pipelines, filter incrementally and use merge patterns instead of overwriting large datasets.

This reduces data volume processed per run, shortens execution times, and reduces compute cost.

2. Optimise storage & Delta layout (often the biggest impact area)

Most slow pipelines are ultimately storage-layout issues.

Use Delta Lake by default

It gives you ACID transactions, schema enforcement, time travel, and optimisations such as OPTIMIZE, data skipping, and vacuuming.

Right-size your files

Small files slow everything down.

Use table optimisation to compact files into appropriate sizes (often 128–512 MB depending on usage).

Partitioning strategy matters

Partition on low-cardinality, commonly-filtered columns (e.g., date, region).
Avoid partitioning on high-cardinality attributes (e.g., user_id).
For more complex filtering, use multi-dimensional clustering techniques instead of excessive partitioning.

Schema governance

Use schema evolution features carefully.
Enforce schemas through Unity Catalog and table contracts.
Avoid relying on schema inference in production pipelines.

3. Use the right ingestion & orchestration tools

Auto Loader

Handles large directories efficiently.
Provides incremental ingestion with exactly-once guarantees.
Scales metadata operations and supports schema evolution.

Lakeflow declarative pipelines (formerly DLT)

Define tables and dependencies declaratively; the platform handles orchestration.
Built-in expectations (data quality rules) and optimisations.
Ideal for Bronze → Silver → Gold patterns with minimal boilerplate.

Hand-rolled ingestion loops or custom notebook orchestrations usually become complex and expensive over time, the built-in tools often provide the same logic more efficiently.

4. Spark / SQL-level optimisations

Small code decisions can have huge impact.

Use SQL where possible

SQL with Delta and the execution engine is generally faster than row-by-row processing in Python.

Window functions, joins, and aggregations outperform custom logic.

Minimise UDF usage

Prefer native SQL or built-in functions.
Use vectorised UDFs only when necessary.
Keep functions stateless.

Join optimisation

Identify data skew; use salting or skew-handling techniques as needed.
Apply join strategies based on table sizes (e.g., using smaller tables to reduce shuffles).
Drop unused columns as early as possible to reduce shuffle load.

Caching

Cache only when needed and remember to uncache to avoid memory pressure.

5. Right-size compute & leverage engine features

Photon

Works especially well for SQL and ETL workloads, improving latency and CPU efficiency.

Cluster tuning

Autoscaling helps, but define sensible min/max limits.
Use spot or preemptible instances for non-critical batch jobs.
Ensure the driver has sufficient memory for planning, collects, or broadcasts.

Serverless compute

Great for ad-hoc or spiky workloads, eliminating idle compute cost.

6. Make cost & performance observable

You can only optimise what you measure.

Use tagging for clusters and jobs (environment, owner, project).

Leverage system tables and account usage to track:

DBU consumption
job execution times
tables with excessive file counts
unnecessary daily/hourly jobs

Many organisations discover that 10–20% of jobs can be merged, reduced in frequency, or retired altogether.

7. Engineering discipline: CI/CD, testing, modularity

Pipeline optimisation is not just technical, it's organisational.

Modular code

Split ingestion logic, transformation logic, and utilities into separate modules or notebooks.

Testing and data quality

Unit tests for logic.
Data quality rules in declarative pipelines or enforced in SQL.

CI/CD with Repos or Asset Bundles

Promote code between dev → test → prod environments.

Avoid editing directly in production.

This improves reliability and reduces the cost of trial-and-error development.

8. Checklist I use for diagnosing slow or expensive pipelines

Incremental vs full refresh?
Are source/target in Delta with correct file sizes?
Over-partitioned or under-partitioned?
Excessive shuffles?
Any heavy UDFs that could be replaced?
Proper compute size + using Photon?
Run frequency appropriate?
Regular compaction and vacuuming?

Addressing just a few of these usually gives immediate improvements.

This list does not cover everything, but for practitioners new to Databricks, it’s a solid starting point.

All the best,

Bianca

Databricks Community

What strategies have you found most effective for optimizing ETL pipelines built on the Databricks L

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples