Hi All,
I’m currently working on a Proof of Concept (POC) to migrate existing Talend ETL jobs to Databricks. The goal is to leverage Databricks for data processing and orchestration while moving away from Talend.
I’d appreciate insights on the following:
Migration Approach:
- Is there a recommended strategy for converting Talend jobs (which use components like tMap, tFileInputDelimited, etc.) into Databricks workflows?
- Should we rewrite logic using PySpark/SQL in notebooks, or is there any automation tool or accelerator available?
Data Orchestration:
- How do you typically handle job scheduling and dependencies in Databricks compared to Talend’s job orchestration?
- Any tips for integrating with Airflow or Databricks Workflows?
Performance & Optimization:
- What are the best practices for optimizing ETL logic when moving from Talend’s row-based processing to Spark’s distributed architecture?
Common Pitfalls:
- What challenges should we anticipate during migration (e.g., error handling, schema evolution, incremental loads)?
If anyone has gone through a similar migration or has resources, templates, or accelerators to share, that would be extremely helpful.
Thanks in advance!