When I think about optimising ETL on the Databricks Lakehouse, I split it into four layers: data layout, Spark/SQL design, platform configuration, and operational excellence.
And above all: you are not building pipelines for yourself, you are building them for the business to consume. Keeping this in mind changes how you design for reliability, consistency, and performance.
- Start with the right architecture: medallion + incremental
Avoid building a single, complex pipeline that tries to do everything.
Medallion layers
- Bronze: raw, ingested as-is
- Silver: cleaned and conformed
- Gold: business-ready marts and aggregates
This keeps each step simpler, easier to test, and cheaper to rerun.
Always design for incremental processing
- Use timestamp columns (e.g., ingestion_date, updated_at).
- With declarative pipelines (Lakeflow / former DLT), prefer built-in CDC capabilities instead of full reloads.
- For batch pipelines, filter incrementally and use merge patterns instead of overwriting large datasets.
This reduces data volume processed per run, shortens execution times, and reduces compute cost.
2. Optimise storage & Delta layout (often the biggest impact area)
Most slow pipelines are ultimately storage-layout issues.
Use Delta Lake by default
It gives you ACID transactions, schema enforcement, time travel, and optimisations such as OPTIMIZE, data skipping, and vacuuming.
Right-size your files
Small files slow everything down.
Use table optimisation to compact files into appropriate sizes (often 128โ512 MB depending on usage).
Partitioning strategy matters
- Partition on low-cardinality, commonly-filtered columns (e.g., date, region).
- Avoid partitioning on high-cardinality attributes (e.g., user_id).
- For more complex filtering, use multi-dimensional clustering techniques instead of excessive partitioning.
Schema governance
- Use schema evolution features carefully.
- Enforce schemas through Unity Catalog and table contracts.
- Avoid relying on schema inference in production pipelines.
3. Use the right ingestion & orchestration tools
Auto Loader
- Handles large directories efficiently.
- Provides incremental ingestion with exactly-once guarantees.
- Scales metadata operations and supports schema evolution.
Lakeflow declarative pipelines (formerly DLT)
- Define tables and dependencies declaratively; the platform handles orchestration.
- Built-in expectations (data quality rules) and optimisations.
- Ideal for Bronze โ Silver โ Gold patterns with minimal boilerplate.
Hand-rolled ingestion loops or custom notebook orchestrations usually become complex and expensive over time, the built-in tools often provide the same logic more efficiently.
4. Spark / SQL-level optimisations
Small code decisions can have huge impact.
Use SQL where possible
SQL with Delta and the execution engine is generally faster than row-by-row processing in Python.
Window functions, joins, and aggregations outperform custom logic.
Minimise UDF usage
- Prefer native SQL or built-in functions.
- Use vectorised UDFs only when necessary.
- Keep functions stateless.
Join optimisation
- Identify data skew; use salting or skew-handling techniques as needed.
- Apply join strategies based on table sizes (e.g., using smaller tables to reduce shuffles).
- Drop unused columns as early as possible to reduce shuffle load.
Caching
Cache only when needed and remember to uncache to avoid memory pressure.
5. Right-size compute & leverage engine features
Photon
Works especially well for SQL and ETL workloads, improving latency and CPU efficiency.
Cluster tuning
- Autoscaling helps, but define sensible min/max limits.
- Use spot or preemptible instances for non-critical batch jobs.
- Ensure the driver has sufficient memory for planning, collects, or broadcasts.
Serverless compute
Great for ad-hoc or spiky workloads, eliminating idle compute cost.
6. Make cost & performance observable
You can only optimise what you measure.
Use tagging for clusters and jobs (environment, owner, project).
Leverage system tables and account usage to track:
- DBU consumption
- job execution times
- tables with excessive file counts
- unnecessary daily/hourly jobs
Many organisations discover that 10โ20% of jobs can be merged, reduced in frequency, or retired altogether.
7. Engineering discipline: CI/CD, testing, modularity
Pipeline optimisation is not just technical, it's organisational.
Modular code
Split ingestion logic, transformation logic, and utilities into separate modules or notebooks.
Testing and data quality
- Unit tests for logic.
- Data quality rules in declarative pipelines or enforced in SQL.
CI/CD with Repos or Asset Bundles
Promote code between dev โ test โ prod environments.
Avoid editing directly in production.
This improves reliability and reduces the cost of trial-and-error development.
8. Checklist I use for diagnosing slow or expensive pipelines
- Incremental vs full refresh?
- Are source/target in Delta with correct file sizes?
- Over-partitioned or under-partitioned?
- Excessive shuffles?
- Any heavy UDFs that could be replaced?
- Proper compute size + using Photon?
- Run frequency appropriate?
- Regular compaction and vacuuming?
Addressing just a few of these usually gives immediate improvements.
This list does not cover everything, but for practitioners new to Databricks, itโs a solid starting point.
All the best,
Bianca