cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Efficiently orchestrate data bricks jobs

Phani1
Valued Contributor

Hi Team,

How efficiently can orchestrate data bricks jobs which involve a lot of transformations, dependencies, and complexity?

At source have a lot of SSIS packages that have complex dependencies and more transformation.     

We have the following options.

1) implement the logic in data bricks notebooks and schedule by using data bricks jobs, workflows

2) Implement the logic in DBT and schedule in DBT 

3) Implement the logic in data bricks notebooks and schedule by using ADF

Could you please suggest what is the best way to implement which has more feasible to re-run and cost-saving? Please share if there is any reference docs/links.

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @Janga Reddy​, Databricks provides several tools to orchestrate complex jobs involving many transformations and dependencies efficiently.

Here are some suggestions for how to implement a cost-saving and easily re-runnable orchestration:

  1. Use Databricks Jobs: Databricks Jobs allows you to regularly schedule and run notebooks, scripts, and binaries. By using Jobs, you can automate the execution of your complex jobs, ensuring that they run consistently and efficiently. You can also use Jobs to set up dependencies between different jobs, ensuring that they run in the correct order.
  2. Use Databricks Delta Lake: Delta Lake is an open source storage layer that brings reliability to data lakes. It provides ACID transactions, scalable metadata handling, and unified batch and streaming processing. By using Delta Lake, you can ensure that your data is always consistent and available for analysis.
  3. Use Databricks Notebooks: Notebooks are a great way to develop and test your transformations. Using Notebooks lets, you quickly prototype new ideas, collaborate with others, and debug issues. Once you have finalized your transformations, you can move them to a production environment, such as a Databricks Job.
  4. Use Databricks APIs: Databricks APIs provide programmatic access to Databricks services. You can automate the deployment of your jobs and transformations using APIs and monitor their execution.
  5. Use Databricks AutoML: Databricks AutoML allows you to automate the machine learning pipeline, from data preparation to model deployment. By using AutoML, you can save time and reduce the cost of developing and deploying machine learning models.

Regarding cost-saving and re-runnability, an efficient data processing framework, such as Apache Spark™, must be used to scale horizontally to handle large data volumes. Additionally, it would be best to leverage cloud storage options, such as Amazon S3 or Azure Blob Storage, to store your data, as they provide low-cost, highly scalable storage options. Finally, it's essential to monitor and optimize your jobs for performance and cost using tools like Databricks Job monitoring and optimizing your cluster size and configuration.

Anonymous
Not applicable

Hi @Janga Reddy​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Phani1
Valued Contributor

My question is, how do we reliably orchestrate multiple Databricks Jobs/Workflows that are running in a mixed latency and can write to the same silver and gold delta tables? Could you please suggest the best approach and practices for the same?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.