ā12-07-2024 10:55 PM
Iām working on a project where I need to pull large datasets from Cosmos DB into Databricks for further processing, and Iām trying to decide whether to use Azure Data Factory (ADF) or Databricks PySpark notebooks for the extraction and processing tasks.
Based on these factors, I feel that a PySpark-based solution is more efficient, but Iād like to hear from others with experience.
Iād greatly appreciate any insights or experiences you can share!
Thanks in advance!
ā12-08-2024 07:37 AM
Hi @DebIT2011 ,
In my experience, consolidating both code and orchestration entirely within Databricks provides substantial benefits. By leveraging Databricks Notebooks for coding and Databricks Workflows for orchestrationāpotentially managed as code through YAML filesāyou maintain a single, unified environment. This setup simplifies everything from development to CI/CD pipelines, making ongoing maintenance far more manageable.
While ADF offers a low-code approach, it becomes cumbersome once you introduce more complex logic. Splitting logic between ADF and Databricks quickly leads to maintenance challenges.
Although ADF can be a decent starting point for those new to the ecosystem, in my opinion, it doesnāt scale as effectively as a fully Databricks-centric approach.
Given these considerations, I would recommend keeping all logic in Databricks. This approach ensures the codebase, orchestration, and operational workflows remain in one place, improving long-term scalability and maintainability.
ā12-08-2024 11:08 AM
Hi @DebIT2011,
Hope so you're doing good, for incremental upserts and deletes from Cosmos DB, Databricks PySpark offers simplicity with unified management, especially for complex transformations and dependency handling. ADF may excel in GUI-based orchestration and integration scenarios but adds complexity by splitting processes. Consider Databricks for scalability and PySpark scripts for flexibility, while ADF could reduce setup effort for simpler workflows.
Hope so this will work for you!
Regards,
Bryce June
ā01-09-2025 04:50 AM
Hi @DebIT2011,
Deciding between Azure Data Factory (ADF) and Databricks PySpark Notebooks for data extraction and processing depends on several factors specific to your use case. Letās address each aspect raised:
Based on the details provided, PySpark in Databricks seems more aligned with your use case, given the focus on complex operations and the need to simplify management. However, if your team values a low-code interface or wants to minimize initial setup efforts, ADF can still be a viable choice.
Regards,
Blank Richards
2 weeks ago
Hi @DebIT2011,
You're facing a classic architectural decision between orchestration with ADF versus direct transformation using Databricks PySpark notebooks. Both tools are powerful but serve different purposes depending on your project needs. Below is a comprehensive analysis and step-by-step guidance to help you choose the most effective approach for your use case involving incremental upserts and deletes from Cosmos DB.
ADF shines in data orchestration and integration scenarios. Some key benefits include:
GUI-Based Orchestration:
You can design ETL/ELT workflows visually without writing code.
Easier for teams unfamiliar with coding-heavy environments.
Built-In Connectors:
ADF has native connectors for Cosmos DB, Azure Blob Storage, ADLS Gen2, and Databricks, reducing setup effort.
Separation of Concerns:
Ideal when you want to decouple orchestration from transformation logic.
Each stage (extraction, staging, transformation) can be managed independently.
Monitoring and Alerts:
Offers centralized logging, retry logic, alerting, and execution history.
However, in your case, you already mentioned that:
Managing jobs across both ADF and Databricks introduces operational complexity.
Delete operations require advanced transformations that are easier in PySpark.
From your use case, Databricks PySpark seems more appropriate. Here's why:
Unified Workflow:
You can write, manage, monitor, and schedule everything inside Databricks, reducing tool sprawl.
Advanced Transformations:
Complex delete logic, joins, and conditional updates are easier in PySpark than ADFās native data flows.
Reusability & Modularity:
You can create parameterized scripts, define schemas dynamically, and version control them via Git integration.
Dependency Management:
Job orchestration within Databricks (using Job Workflows or Task dependencies) is more seamless when upsert and delete steps are logically connected.
Performance & Scale:
Databricks (especially on Photon or Delta engines) can outperform ADFās data flows when dealing with very large volumes of data and high-throughput jobs.
Cost | Lower for basic ETL; can scale with IR and batch jobs | Can be higher for continuous workloads, but better optimized for Spark-heavy tasks |
Setup | Simple for pipelines, less coding | Requires coding, but more flexible |
Scalability | Scales well with Integration Runtime | Scales very well with Spark clusters, suited for big data |
Performance | Slower for heavy transformations | Optimized for transformations, joins, and complex logic |
Monitoring | GUI, detailed logs for pipelines | Unified notebook logs, cluster metrics, and job histories |
Since youāre doing incremental upserts and complex deletes, keeping both logic and orchestration within Databricks will reduce overhead.
Use Delta Lake tables to efficiently handle merges and deletes.
Accept table_name, schema, source_path, and operation_type (upsert/delete) as parameters.
Implement merge logic for upserts using Delta Lake.
Use conditional filters and DELETE FROM for deletes, encapsulating all transformation logic.
Set up Job Workflows in Databricks to control task execution order.
For example:
Task 1: Ingest new data
Task 2: Upsert
Task 3: Conditional Delete
Define dependencies between tasks for execution control.
Use last_updated_timestamp or _ts field from Cosmos DB to identify delta changes.
Optionally use change feed in Cosmos DB if high-frequency updates are expected.
If your team prefers ADF for connecting to Cosmos DB, you can extract and land data as Parquet in ADLS Gen2 via ADF.
From there, Databricks takes over - but remember this adds management overhead in two places.
Avoid splitting logic across tools unless necessary ā it adds complexity to dependency management and debugging.
Monitor resource costs in Databricks, especially if using large clusters. Auto-terminate and job cluster configurations can help.
Use Git for version control and CI/CD if you're standardizing on notebooks.
Leverage Unity Catalog or Table ACLs if youāre managing shared environments across teams.
Based on your scenario, Databricks PySpark notebooks are better suited. Youāll benefit from:
Simpler architecture (no tool handoffs),
Greater flexibility for transformation logic,
Easier dependency management,
And better performance on large data volumes.
ADF can be useful in hybrid cases or when non-technical teams must design or manage workflows, but for your case ā stick with Databricks.
Regards,
Johns Mak
Passionate about hosting events and connecting people? Help us grow a vibrant local communityāsign up today to get started!
Sign Up Now