Databricks Community

Suheb · Monday

When moving your big data pipelines from local servers to Databricks, what problems usually happen, and how did you fix them?

Raman_Unifeye · Tuesday

Very broad Question. It depends on several factors.

There were few community discussions in past, see if any useful for yourself.

https://community.databricks.com/t5/technical-blog/6-migration-mistakes-you-don-t-want-to-make-part-...

RG #Driving Business Outcomes with Data Intelligence

jameswood32 · yesterday

Common pitfalls when migrating large on-prem ETL workflows to Databricks include:

Assuming a 1:1 migration – On-prem jobs often need re-architecture for Spark’s distributed model.
Ignoring data skew and partitioning – Large datasets can cause performance bottlenecks if not properly partitioned.
Underestimating dependencies – Legacy scripts, stored procedures, and external systems often break without proper mapping.
Inefficient cost management – Autoscaling and cluster sizing need careful tuning to avoid overspending.
Testing gaps – End-to-end validation is crucial; small logic changes can have big impacts at scale.

In our migration, we tackled these by re-architecting pipelines for Spark, optimizing partitions, mapping dependencies early, and building a robust testing framework before going live.

James Wood

ShaneCorn · yesterday

Common pitfalls when migrating large on-premise ETL workflows to Databricks include data compatibility issues, lack of scalability planning, and inefficient resource management. Data transformation logic may need to be rewritten for Spark compatibility. Additionally, performance tuning in a cloud environment can be challenging without proper cost management strategies. To avoid these issues, ensure thorough testing, optimize Spark configurations, and use Delta Lake for efficient data management. Implementing automated scaling and monitoring also helps maintain performance and minimize costs.

tarunnagar · yesterday

Migrating large on-premise ETL workflows to Databricks often goes wrong when teams try to “lift and shift” legacy logic directly into Spark. Poor data layout, tiny files, and inefficient partitioning can quickly hurt performance, so restructuring data and adopting Delta Lake early is crucial. Many also underestimate the need to redesign pipelines for distributed processing rather than step-by-step ETL. Cluster sizing, cost control, and missing orchestration workflows (dependencies, retries, alerts) are other common pain points. Security mapping and schema evolution issues can also cause failures. The key is to optimize data structures, modernize transformations, establish proper workflow orchestration, and test with realistic data volumes.

Databricks Community

What are common pitfalls when migrating large on-premise ETL workflows to Databricks and how did you

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples