Databricks Community

DebIT2011 · ‎12-07-2024

I’m working on a project where I need to pull large datasets from Cosmos DB into Databricks for further processing, and I’m trying to decide whether to use Azure Data Factory (ADF) or Databricks PySpark notebooks for the extraction and processing tasks.

Use Case:

The data requires incremental upserts and deletes, which need to be handled separately.
With ADF, I would create a pipeline to extract the data from Cosmos DB, store it as Parquet in ADLS Gen2, and then transfer the file to Databricks. A Delta Live Table (DLT) pipeline would be triggered to create a streaming table in Databricks (Data will be merged from the temp to the target table)
However, this approach means managing and monitoring code and jobs in two places (ADF and Databricks), adding complexity.
On the other hand, with PySpark in Databricks, I can create reusable scripts for the upsert operation, and specify the schema and table name at the job level. This would keep everything within Databricks, simplifying job management.
Since the delete operation is complex and requires additional transformations, I prefer handling it directly in PySpark (Databricks Notebook).
In Databricks, managing dependencies between the upsert and delete jobs is straightforward, whereas with ADF → DLT → PySpark delete jobs, managing dependencies becomes more intricate.

Based on these factors, I feel that a PySpark-based solution is more efficient, but I’d like to hear from others with experience.

Questions:

What are the advantages of using ADF for this task over Databricks PySpark notebooks?
Are there specific scenarios where PySpark in Databricks would be more effective for pulling and processing data from Cosmos DB?
How do cost, scalability, performance and setup complexity compare between using ADF and Databricks for this use case?
What best practices or pitfalls should I consider when choosing between ADF and Databricks notebooks for data extraction?

I’d greatly appreciate any insights or experiences you can share!

Thanks in advance!

filipniziol · ‎12-08-2024

Hi @DebIT2011 ,

In my experience, consolidating both code and orchestration entirely within Databricks provides substantial benefits. By leveraging Databricks Notebooks for coding and Databricks Workflows for orchestration—potentially managed as code through YAML files—you maintain a single, unified environment. This setup simplifies everything from development to CI/CD pipelines, making ongoing maintenance far more manageable.

While ADF offers a low-code approach, it becomes cumbersome once you introduce more complex logic. Splitting logic between ADF and Databricks quickly leads to maintenance challenges.

Although ADF can be a decent starting point for those new to the ecosystem, in my opinion, it doesn’t scale as effectively as a fully Databricks-centric approach.

Given these considerations, I would recommend keeping all logic in Databricks. This approach ensures the codebase, orchestration, and operational workflows remain in one place, improving long-term scalability and maintainability.

brycejune · ‎12-08-2024

Hi @DebIT2011,

Hope so you're doing good, for incremental upserts and deletes from Cosmos DB, Databricks PySpark offers simplicity with unified management, especially for complex transformations and dependency handling. ADF may excel in GUI-based orchestration and integration scenarios but adds complexity by splitting processes. Consider Databricks for scalability and PySpark scripts for flexibility, while ADF could reduce setup effort for simpler workflows.

Hope so this will work for you!

Regards,
Bryce June

BlankRichards · ‎01-09-2025

Hi @DebIT2011,

Deciding between Azure Data Factory (ADF) and Databricks PySpark Notebooks for data extraction and processing depends on several factors specific to your use case. Let’s address each aspect raised:

Advantages of ADF over Databricks PySpark Notebooks

Low-Code Interface: ADF offers a user-friendly graphical interface, making it easier to create and manage pipelines without extensive coding. Ideal for teams with limited programming expertise.
Native Automation and Connectors: ADF has native connectors for Cosmos DB and ADLS Gen2, simplifying integration and scaling.
Centralized Monitoring: ADF provides detailed pipeline monitoring directly in the Azure Portal, facilitating traceability and failure management.
Separation of Concerns: Using ADF for extraction and Databricks for processing ensures a clear division of responsibilities between specialized tools.

Scenarios Where PySpark in Databricks is More Efficient

Complex Transformation Operations: As in your case, where deletes and upserts involve complex logic, PySpark allows flexibility and direct control over data and operations.
Streamlined Workflows: Keeping everything within Databricks reduces the complexity of managing two separate environments (ADF and Databricks).
Script Reusability: PySpark makes it easy to modularize and reuse scripts, improving development efficiency.
Dependency Management: In Databricks, managing dependencies between tasks is more straightforward, particularly for sequential operations like upserts and deletes.

Comparison of Cost, Scalability, Performance, and Complexity

Cost:
- ADF is generally more cost-effective for simple pipelines and low-volume ETL processes.
- Databricks can be more expensive, especially for heavy workloads, due to computational resource consumption on clusters.
Scalability:
- ADF is highly scalable for data integration across multiple sources.
- Databricks is better suited for intensive processing and horizontal scaling of complex transformations.
Performance:
- ADF may have performance limitations for highly customized operations.
- PySpark in Databricks delivers better performance for large datasets and advanced logic.
Setup and Complexity:
- ADF has a simpler and faster initial setup for ETL pipelines.
- Databricks requires more initial effort for configuration, but consolidating everything in a single environment reduces long-term complexity.

Best Practices and Pitfalls to Avoid

Consider Data Volume: For large datasets, Databricks is better for performance and parallelism.
Manage Dependencies: If choosing ADF, carefully plan pipeline chaining to avoid failures or delays.
Modularize and Document: PySpark scripts should be well modularized for easier maintenance.
Monitor Costs: Use Azure metrics and alerts to avoid unexpected costs in Databricks or ADF.
Test and Iterate: Evaluate both approaches in a test environment to validate performance and cost for your specific use case.

Based on the details provided, PySpark in Databricks seems more aligned with your use case, given the focus on complex operations and the need to simplify management. However, if your team values a low-code interface or wants to minimize initial setup efforts, ADF can still be a viable choice.

Regards,
Blank Richards

Johns404 · ‎05-31-2025

Hi @DebIT2011,

You're facing a classic architectural decision between orchestration with ADF versus direct transformation using Databricks PySpark notebooks. Both tools are powerful but serve different purposes depending on your project needs. Below is a comprehensive analysis and step-by-step guidance to help you choose the most effective approach for your use case involving incremental upserts and deletes from Cosmos DB.

✅Advantages of Azure Data Factory (ADF)

ADF shines in data orchestration and integration scenarios. Some key benefits include:

GUI-Based Orchestration:
- You can design ETL/ELT workflows visually without writing code.
- Easier for teams unfamiliar with coding-heavy environments.
Built-In Connectors:
- ADF has native connectors for Cosmos DB, Azure Blob Storage, ADLS Gen2, and Databricks, reducing setup effort.
Separation of Concerns:
- Ideal when you want to decouple orchestration from transformation logic.
- Each stage (extraction, staging, transformation) can be managed independently.
Monitoring and Alerts:
- Offers centralized logging, retry logic, alerting, and execution history.

However, in your case, you already mentioned that:

Managing jobs across both ADF and Databricks introduces operational complexity.
Delete operations require advanced transformations that are easier in PySpark.

✅When Databricks PySpark Notebooks Are Better

From your use case, Databricks PySpark seems more appropriate. Here's why:

Unified Workflow:
- You can write, manage, monitor, and schedule everything inside Databricks, reducing tool sprawl.
Advanced Transformations:
- Complex delete logic, joins, and conditional updates are easier in PySpark than ADF’s native data flows.
Reusability & Modularity:
- You can create parameterized scripts, define schemas dynamically, and version control them via Git integration.
Dependency Management:
- Job orchestration within Databricks (using Job Workflows or Task dependencies) is more seamless when upsert and delete steps are logically connected.
Performance & Scale:
- Databricks (especially on Photon or Delta engines) can outperform ADF’s data flows when dealing with very large volumes of data and high-throughput jobs.

🔍Cost, Scalability, Performance, and Setup Comparison

Feature ADF Databricks

Cost	Lower for basic ETL; can scale with IR and batch jobs	Can be higher for continuous workloads, but better optimized for Spark-heavy tasks
Setup	Simple for pipelines, less coding	Requires coding, but more flexible
Scalability	Scales well with Integration Runtime	Scales very well with Spark clusters, suited for big data
Performance	Slower for heavy transformations	Optimized for transformations, joins, and complex logic
Monitoring	GUI, detailed logs for pipelines	Unified notebook logs, cluster metrics, and job histories

🛠️ Step-by-Step Recommendation for Your Use Case

📌Step 1: Stick to Databricks PySpark for Transformation

Since you’re doing incremental upserts and complex deletes, keeping both logic and orchestration within Databricks will reduce overhead.
Use Delta Lake tables to efficiently handle merges and deletes.

📌Step 2: Create a Parameterized PySpark Notebook

Accept table_name, schema, source_path, and operation_type (upsert/delete) as parameters.
Implement merge logic for upserts using Delta Lake.
Use conditional filters and DELETE FROM for deletes, encapsulating all transformation logic.

📌Step 3: Use Databricks Jobs to Orchestrate

Set up Job Workflows in Databricks to control task execution order.
For example:
- Task 1: Ingest new data
- Task 2: Upsert
- Task 3: Conditional Delete
Define dependencies between tasks for execution control.

📌Step 4: Optimize for Incremental Loads

Use last_updated_timestamp or _ts field from Cosmos DB to identify delta changes.
Optionally use change feed in Cosmos DB if high-frequency updates are expected.

📌Step 5: Consider ADF for Initial Ingestion Only (Optional)

If your team prefers ADF for connecting to Cosmos DB, you can extract and land data as Parquet in ADLS Gen2 via ADF.
From there, Databricks takes over - but remember this adds management overhead in two places.

⚠️Best Practices & Pitfalls to Avoid

Avoid splitting logic across tools unless necessary — it adds complexity to dependency management and debugging.
Monitor resource costs in Databricks, especially if using large clusters. Auto-terminate and job cluster configurations can help.
Use Git for version control and CI/CD if you're standardizing on notebooks.
Leverage Unity Catalog or Table ACLs if you’re managing shared environments across teams.

✅Final Recommendation

Based on your scenario, Databricks PySpark notebooks are better suited. You’ll benefit from:

Simpler architecture (no tool handoffs),
Greater flexibility for transformation logic,
Easier dependency management,
And better performance on large data volumes.

ADF can be useful in hybrid cases or when non-technical teams must design or manage workflows, but for your case — stick with Databricks.

Regards,
Johns Mak