3 weeks ago
If you are building data pipelines in Databricks (where data is Extracted, Transformed, and Loaded), what tips, methods, or best practices do you use to make those pipelines run faster, cheaper, and more efficiently?
3 weeks ago - last edited 3 weeks ago
When I think about optimising ETL on the Databricks Lakehouse, I split it into four layers: data layout, Spark/SQL design, platform configuration, and operational excellence.
And above all: you are not building pipelines for yourself, you are building them for the business to consume. Keeping this in mind changes how you design for reliability, consistency, and performance.
Avoid building a single, complex pipeline that tries to do everything.
Medallion layers
This keeps each step simpler, easier to test, and cheaper to rerun.
Always design for incremental processing
This reduces data volume processed per run, shortens execution times, and reduces compute cost.
2. Optimise storage & Delta layout (often the biggest impact area)
Most slow pipelines are ultimately storage-layout issues.
Use Delta Lake by default
It gives you ACID transactions, schema enforcement, time travel, and optimisations such as OPTIMIZE, data skipping, and vacuuming.
Right-size your files
Small files slow everything down.
Use table optimisation to compact files into appropriate sizes (often 128โ512 MB depending on usage).
Partitioning strategy matters
Schema governance
3. Use the right ingestion & orchestration tools
Auto Loader
Lakeflow declarative pipelines (formerly DLT)
Hand-rolled ingestion loops or custom notebook orchestrations usually become complex and expensive over time, the built-in tools often provide the same logic more efficiently.
4. Spark / SQL-level optimisations
Small code decisions can have huge impact.
Use SQL where possible
SQL with Delta and the execution engine is generally faster than row-by-row processing in Python.
Window functions, joins, and aggregations outperform custom logic.
Minimise UDF usage
Join optimisation
Caching
Cache only when needed and remember to uncache to avoid memory pressure.
5. Right-size compute & leverage engine features
Photon
Works especially well for SQL and ETL workloads, improving latency and CPU efficiency.
Cluster tuning
Serverless compute
Great for ad-hoc or spiky workloads, eliminating idle compute cost.
6. Make cost & performance observable
You can only optimise what you measure.
Use tagging for clusters and jobs (environment, owner, project).
Leverage system tables and account usage to track:
Many organisations discover that 10โ20% of jobs can be merged, reduced in frequency, or retired altogether.
7. Engineering discipline: CI/CD, testing, modularity
Pipeline optimisation is not just technical, it's organisational.
Modular code
Split ingestion logic, transformation logic, and utilities into separate modules or notebooks.
Testing and data quality
CI/CD with Repos or Asset Bundles
Promote code between dev โ test โ prod environments.
Avoid editing directly in production.
This improves reliability and reduces the cost of trial-and-error development.
8. Checklist I use for diagnosing slow or expensive pipelines
Addressing just a few of these usually gives immediate improvements.
This list does not cover everything, but for practitioners new to Databricks, itโs a solid starting point.
All the best,
Bianca
3 weeks ago - last edited 3 weeks ago
@bianca_unifeye you have covered most of them.
I would like to add a few,
With respect to choosing data format, i would recommend delta format.
Below are some areas of improvements while using delta,
1. enable deletion vector for improved performance
2. enable liquid clustering ( improves performance and partition size )
3. delta checkpoint interval ( a checkpoint file gets created at the mentioned interval, helps for faster read )
4. delta auto compact and auto optimize can be enabled.
with respect to spark,
1. enable AQE (https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution). Based on the requirement try to enable the config parameters under AQE. ( handles most of the common performance issues )
2. Use dataframes or SQL, as it has catalyst optimizer and when enabled with photon gives the best performance ( but photon comes with cost, so think which is important for you time or money )
3 weeks ago - last edited 3 weeks ago
When I think about optimising ETL on the Databricks Lakehouse, I split it into four layers: data layout, Spark/SQL design, platform configuration, and operational excellence.
And above all: you are not building pipelines for yourself, you are building them for the business to consume. Keeping this in mind changes how you design for reliability, consistency, and performance.
Avoid building a single, complex pipeline that tries to do everything.
Medallion layers
This keeps each step simpler, easier to test, and cheaper to rerun.
Always design for incremental processing
This reduces data volume processed per run, shortens execution times, and reduces compute cost.
2. Optimise storage & Delta layout (often the biggest impact area)
Most slow pipelines are ultimately storage-layout issues.
Use Delta Lake by default
It gives you ACID transactions, schema enforcement, time travel, and optimisations such as OPTIMIZE, data skipping, and vacuuming.
Right-size your files
Small files slow everything down.
Use table optimisation to compact files into appropriate sizes (often 128โ512 MB depending on usage).
Partitioning strategy matters
Schema governance
3. Use the right ingestion & orchestration tools
Auto Loader
Lakeflow declarative pipelines (formerly DLT)
Hand-rolled ingestion loops or custom notebook orchestrations usually become complex and expensive over time, the built-in tools often provide the same logic more efficiently.
4. Spark / SQL-level optimisations
Small code decisions can have huge impact.
Use SQL where possible
SQL with Delta and the execution engine is generally faster than row-by-row processing in Python.
Window functions, joins, and aggregations outperform custom logic.
Minimise UDF usage
Join optimisation
Caching
Cache only when needed and remember to uncache to avoid memory pressure.
5. Right-size compute & leverage engine features
Photon
Works especially well for SQL and ETL workloads, improving latency and CPU efficiency.
Cluster tuning
Serverless compute
Great for ad-hoc or spiky workloads, eliminating idle compute cost.
6. Make cost & performance observable
You can only optimise what you measure.
Use tagging for clusters and jobs (environment, owner, project).
Leverage system tables and account usage to track:
Many organisations discover that 10โ20% of jobs can be merged, reduced in frequency, or retired altogether.
7. Engineering discipline: CI/CD, testing, modularity
Pipeline optimisation is not just technical, it's organisational.
Modular code
Split ingestion logic, transformation logic, and utilities into separate modules or notebooks.
Testing and data quality
CI/CD with Repos or Asset Bundles
Promote code between dev โ test โ prod environments.
Avoid editing directly in production.
This improves reliability and reduces the cost of trial-and-error development.
8. Checklist I use for diagnosing slow or expensive pipelines
Addressing just a few of these usually gives immediate improvements.
This list does not cover everything, but for practitioners new to Databricks, itโs a solid starting point.
All the best,
Bianca
3 weeks ago - last edited 3 weeks ago
@bianca_unifeye you have covered most of them.
I would like to add a few,
With respect to choosing data format, i would recommend delta format.
Below are some areas of improvements while using delta,
1. enable deletion vector for improved performance
2. enable liquid clustering ( improves performance and partition size )
3. delta checkpoint interval ( a checkpoint file gets created at the mentioned interval, helps for faster read )
4. delta auto compact and auto optimize can be enabled.
with respect to spark,
1. enable AQE (https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution). Based on the requirement try to enable the config parameters under AQE. ( handles most of the common performance issues )
2. Use dataframes or SQL, as it has catalyst optimizer and when enabled with photon gives the best performance ( but photon comes with cost, so think which is important for you time or money )
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now