cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Choosing between Azure Data Factory (ADF) and Databricks PySpark notebooks

DebIT2011
New Contributor

Iโ€™m working on a project where I need to pull large datasets from Cosmos DB into Databricks for further processing, and Iโ€™m trying to decide whether to use Azure Data Factory (ADF) or Databricks PySpark notebooks for the extraction and processing tasks.

Use Case:

  • The data requires incremental upserts and deletes, which need to be handled separately.
  • With ADF, I would create a pipeline to extract the data from Cosmos DB, store it as Parquet in ADLS Gen2, and then transfer the file to Databricks. A Delta Live Table (DLT) pipeline would be triggered to create a streaming table in Databricks (Data will be merged from the temp to the target table)
  • However, this approach means managing and monitoring code and jobs in two places (ADF and Databricks), adding complexity.
  • On the other hand, with PySpark in Databricks, I can create reusable scripts for the upsert operation, and specify the schema and table name at the job level. This would keep everything within Databricks, simplifying job management.
  • Since the delete operation is complex and requires additional transformations, I prefer handling it directly in PySpark (Databricks Notebook).
  • In Databricks, managing dependencies between the upsert and delete jobs is straightforward, whereas with ADF โ†’ DLT โ†’ PySpark delete jobs, managing dependencies becomes more intricate.

Based on these factors, I feel that a PySpark-based solution is more efficient, but Iโ€™d like to hear from others with experience.

Questions:

  1. What are the advantages of using ADF for this task over Databricks PySpark notebooks?
  2. Are there specific scenarios where PySpark in Databricks would be more effective for pulling and processing data from Cosmos DB?
  3. How do cost, scalability, performance and setup complexity compare between using ADF and Databricks for this use case?
  4. What best practices or pitfalls should I consider when choosing between ADF and Databricks notebooks for data extraction?

Iโ€™d greatly appreciate any insights or experiences you can share!

Thanks in advance!

2 REPLIES 2

filipniziol
Contributor III

Hi @DebIT2011 ,

In my experience, consolidating both code and orchestration entirely within Databricks provides substantial benefits. By leveraging Databricks Notebooks for coding and Databricks Workflows for orchestrationโ€”potentially managed as code through YAML filesโ€”you maintain a single, unified environment. This setup simplifies everything from development to CI/CD pipelines, making ongoing maintenance far more manageable.

While ADF offers a low-code approach, it becomes cumbersome once you introduce more complex logic. Splitting logic between ADF and Databricks quickly leads to maintenance challenges.

Although ADF can be a decent starting point for those new to the ecosystem, in my opinion, it doesnโ€™t scale as effectively as a fully Databricks-centric approach.

Given these considerations, I would recommend keeping all logic in Databricks. This approach ensures the codebase, orchestration, and operational workflows remain in one place, improving long-term scalability and maintainability.

brycejune
New Contributor III

Hi @DebIT2011,

Hope so you're doing good, for incremental upserts and deletes from Cosmos DB, Databricks PySpark offers simplicity with unified management, especially for complex transformations and dependency handling. ADF may excel in GUI-based orchestration and integration scenarios but adds complexity by splitting processes. Consider Databricks for scalability and PySpark scripts for flexibility, while ADF could reduce setup effort for simpler workflows.

Hope so this will work for you!

Regards,
Bryce June

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group