Databricks Community

DebIT2011 · ‎12-07-2024

I’m working on a project where I need to pull large datasets from Cosmos DB into Databricks for further processing, and I’m trying to decide whether to use Azure Data Factory (ADF) or Databricks PySpark notebooks for the extraction and processing tasks.

Use Case:

The data requires incremental upserts and deletes, which need to be handled separately.
With ADF, I would create a pipeline to extract the data from Cosmos DB, store it as Parquet in ADLS Gen2, and then transfer the file to Databricks. A Delta Live Table (DLT) pipeline would be triggered to create a streaming table in Databricks (Data will be merged from the temp to the target table)
However, this approach means managing and monitoring code and jobs in two places (ADF and Databricks), adding complexity.
On the other hand, with PySpark in Databricks, I can create reusable scripts for the upsert operation, and specify the schema and table name at the job level. This would keep everything within Databricks, simplifying job management.
Since the delete operation is complex and requires additional transformations, I prefer handling it directly in PySpark (Databricks Notebook).
In Databricks, managing dependencies between the upsert and delete jobs is straightforward, whereas with ADF → DLT → PySpark delete jobs, managing dependencies becomes more intricate.

Based on these factors, I feel that a PySpark-based solution is more efficient, but I’d like to hear from others with experience.

Questions:

What are the advantages of using ADF for this task over Databricks PySpark notebooks?
Are there specific scenarios where PySpark in Databricks would be more effective for pulling and processing data from Cosmos DB?
How do cost, scalability, performance and setup complexity compare between using ADF and Databricks for this use case?
What best practices or pitfalls should I consider when choosing between ADF and Databricks notebooks for data extraction?

I’d greatly appreciate any insights or experiences you can share!

Thanks in advance!

filipniziol · ‎12-08-2024

Hi @DebIT2011 ,

In my experience, consolidating both code and orchestration entirely within Databricks provides substantial benefits. By leveraging Databricks Notebooks for coding and Databricks Workflows for orchestration—potentially managed as code through YAML files—you maintain a single, unified environment. This setup simplifies everything from development to CI/CD pipelines, making ongoing maintenance far more manageable.

While ADF offers a low-code approach, it becomes cumbersome once you introduce more complex logic. Splitting logic between ADF and Databricks quickly leads to maintenance challenges.

Although ADF can be a decent starting point for those new to the ecosystem, in my opinion, it doesn’t scale as effectively as a fully Databricks-centric approach.

Given these considerations, I would recommend keeping all logic in Databricks. This approach ensures the codebase, orchestration, and operational workflows remain in one place, improving long-term scalability and maintainability.

brycejune · ‎12-08-2024

Hi @DebIT2011,

Hope so you're doing good, for incremental upserts and deletes from Cosmos DB, Databricks PySpark offers simplicity with unified management, especially for complex transformations and dependency handling. ADF may excel in GUI-based orchestration and integration scenarios but adds complexity by splitting processes. Consider Databricks for scalability and PySpark scripts for flexibility, while ADF could reduce setup effort for simpler workflows.

Hope so this will work for you!

Regards,
Bryce June

BlankRichards · ‎01-09-2025

Hi @DebIT2011,

Deciding between Azure Data Factory (ADF) and Databricks PySpark Notebooks for data extraction and processing depends on several factors specific to your use case. Let’s address each aspect raised:

Advantages of ADF over Databricks PySpark Notebooks

Low-Code Interface: ADF offers a user-friendly graphical interface, making it easier to create and manage pipelines without extensive coding. Ideal for teams with limited programming expertise.
Native Automation and Connectors: ADF has native connectors for Cosmos DB and ADLS Gen2, simplifying integration and scaling.
Centralized Monitoring: ADF provides detailed pipeline monitoring directly in the Azure Portal, facilitating traceability and failure management.
Separation of Concerns: Using ADF for extraction and Databricks for processing ensures a clear division of responsibilities between specialized tools.

Scenarios Where PySpark in Databricks is More Efficient

Complex Transformation Operations: As in your case, where deletes and upserts involve complex logic, PySpark allows flexibility and direct control over data and operations.
Streamlined Workflows: Keeping everything within Databricks reduces the complexity of managing two separate environments (ADF and Databricks).
Script Reusability: PySpark makes it easy to modularize and reuse scripts, improving development efficiency.
Dependency Management: In Databricks, managing dependencies between tasks is more straightforward, particularly for sequential operations like upserts and deletes.

Comparison of Cost, Scalability, Performance, and Complexity

Cost:
- ADF is generally more cost-effective for simple pipelines and low-volume ETL processes.
- Databricks can be more expensive, especially for heavy workloads, due to computational resource consumption on clusters.
Scalability:
- ADF is highly scalable for data integration across multiple sources.
- Databricks is better suited for intensive processing and horizontal scaling of complex transformations.
Performance:
- ADF may have performance limitations for highly customized operations.
- PySpark in Databricks delivers better performance for large datasets and advanced logic.
Setup and Complexity:
- ADF has a simpler and faster initial setup for ETL pipelines.
- Databricks requires more initial effort for configuration, but consolidating everything in a single environment reduces long-term complexity.

Best Practices and Pitfalls to Avoid

Consider Data Volume: For large datasets, Databricks is better for performance and parallelism.
Manage Dependencies: If choosing ADF, carefully plan pipeline chaining to avoid failures or delays.
Modularize and Document: PySpark scripts should be well modularized for easier maintenance.
Monitor Costs: Use Azure metrics and alerts to avoid unexpected costs in Databricks or ADF.
Test and Iterate: Evaluate both approaches in a test environment to validate performance and cost for your specific use case.

Based on the details provided, PySpark in Databricks seems more aligned with your use case, given the focus on complex operations and the need to simplify management. However, if your team values a low-code interface or wants to minimize initial setup efforts, ADF can still be a viable choice.

Regards,
Blank Richards