Databricks

Phani1 · ‎03-16-2023

Hi Team,

How efficiently can orchestrate data bricks jobs which involve a lot of transformations, dependencies, and complexity?

At source have a lot of SSIS packages that have complex dependencies and more transformation.

We have the following options.

1) implement the logic in data bricks notebooks and schedule by using data bricks jobs, workflows

2) Implement the logic in DBT and schedule in DBT

3) Implement the logic in data bricks notebooks and schedule by using ADF

Could you please suggest what is the best way to implement which has more feasible to re-run and cost-saving? Please share if there is any reference docs/links.

Kaniz · ‎03-18-2023

Hi @Janga Reddy, Databricks provides several tools to orchestrate complex jobs involving many transformations and dependencies efficiently.

Here are some suggestions for how to implement a cost-saving and easily re-runnable orchestration:

Use Databricks Jobs: Databricks Jobs allows you to regularly schedule and run notebooks, scripts, and binaries. By using Jobs, you can automate the execution of your complex jobs, ensuring that they run consistently and efficiently. You can also use Jobs to set up dependencies between different jobs, ensuring that they run in the correct order.
Use Databricks Delta Lake: Delta Lake is an open source storage layer that brings reliability to data lakes. It provides ACID transactions, scalable metadata handling, and unified batch and streaming processing. By using Delta Lake, you can ensure that your data is always consistent and available for analysis.
Use Databricks Notebooks: Notebooks are a great way to develop and test your transformations. Using Notebooks lets, you quickly prototype new ideas, collaborate with others, and debug issues. Once you have finalized your transformations, you can move them to a production environment, such as a Databricks Job.
Use Databricks APIs: Databricks APIs provide programmatic access to Databricks services. You can automate the deployment of your jobs and transformations using APIs and monitor their execution.
Use Databricks AutoML: Databricks AutoML allows you to automate the machine learning pipeline, from data preparation to model deployment. By using AutoML, you can save time and reduce the cost of developing and deploying machine learning models.

Regarding cost-saving and re-runnability, an efficient data processing framework, such as Apache Spark™, must be used to scale horizontally to handle large data volumes. Additionally, it would be best to leverage cloud storage options, such as Amazon S3 or Azure Blob Storage, to store your data, as they provide low-cost, highly scalable storage options. Finally, it's essential to monitor and optimize your jobs for performance and cost using tools like Databricks Job monitoring and optimizing your cluster size and configuration.

Anonymous · ‎03-19-2023

Hi @Janga Reddy

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

Phani1 · ‎03-20-2023

My question is, how do we reliably orchestrate multiple Databricks Jobs/Workflows that are running in a mixed latency and can write to the same silver and gold delta tables? Could you please suggest the best approach and practices for the same?

Databricks

Efficiently orchestrate data bricks jobs

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!