cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What are common pitfalls when migrating large on-premise ETL workflows to Databricks and how did you

Suheb
New Contributor III

When moving your big data pipelines from local servers to Databricks, what problems usually happen, and how did you fix them?

3 REPLIES 3

Raman_Unifeye
Contributor III

Very broad Question. It depends on several factors.

There were few community discussions in past, see if any useful for yourself.

https://community.databricks.com/t5/technical-blog/6-migration-mistakes-you-don-t-want-to-make-part-...

https://community.databricks.com/t5/technical-blog/6-migration-mistakes-you-don-t-want-to-make-part-...

 


RG #Driving Business Outcomes with Data Intelligence

jameswood32
New Contributor III

Common pitfalls when migrating large on-prem ETL workflows to Databricks include:

  1. Assuming a 1:1 migration โ€“ On-prem jobs often need re-architecture for Sparkโ€™s distributed model.

  2. Ignoring data skew and partitioning โ€“ Large datasets can cause performance bottlenecks if not properly partitioned.

  3. Underestimating dependencies โ€“ Legacy scripts, stored procedures, and external systems often break without proper mapping.

  4. Inefficient cost management โ€“ Autoscaling and cluster sizing need careful tuning to avoid overspending.

  5. Testing gaps โ€“ End-to-end validation is crucial; small logic changes can have big impacts at scale.

In our migration, we tackled these by re-architecting pipelines for Spark, optimizing partitions, mapping dependencies early, and building a robust testing framework before going live.

James Wood

ShaneCorn
New Contributor III

Common pitfalls when migrating large on-premise ETL workflows to Databricks include data compatibility issues, lack of scalability planning, and inefficient resource management. Data transformation logic may need to be rewritten for Spark compatibility. Additionally, performance tuning in a cloud environment can be challenging without proper cost management strategies. To avoid these issues, ensure thorough testing, optimize Spark configurations, and use Delta Lake for efficient data management. Implementing automated scaling and monitoring also helps maintain performance and minimize costs.