cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Solution Design for an ingestion workflow with 1000s of tables for each source

ashraf1395
Valued Contributor

Working on an ingestion workflow in databricks which extracts data from on-prem sources in databricks following all standard practices of incremental load, indempotency, upsert, schema evolution etc and storing data properly.

Now we want to optimize the workflow for scale where we have more than 1000s of tables. Let's say for one source now.

We are using for each feature of databricks workflows to run 100 ingestionl oad tasks in parallel challenge comes where we have more than 100 and the max concurrent tasks can be 100. 

What should be the approach to handle this -
Some options we thought about

In our metadata table we have a column for priority we can use that priority column to group the tables where each workflow handle a particular priority. Then fetches table info according this

Now to run these multi priority workflow we can think of two approaches either create separate workflow for each priority and run them or 

creating a master workflow which captures table according to priority orchestrates and passes them to child workflow where other filtering happens and passes the list in for each task where incremental ingestion happens.

we also came to know about concurrent workflow runs. 

What will be the best approach to move forward considering we will be working on massive scale. Parallization is the preferred approach but considering all its pros and cons and how we will handle the compute cluster configuration etc.

1 REPLY 1

Avinash_Narala
Valued Contributor II

I do did the similar kind of work in my recent project, where I need to run many SQL DDL's , so I automated the process using databricks jobs, capturing the dependency using a metadata table and creating tasks likewise in job through job api's, doing so we reduced the runtime and resource consumption significantly by 33%.

So, Instead of creating separate workflow/job for each priority task which would result in too many jobs and not recommended too, we can create a master workflow/job which captures table according to priority orchestrates and passes them to child workflow where other filtering happens and passes the list in for each task where incremental ingestion happens is the best approach and also make sure to provision the job cluster with required compute memory, so that it'll not take so much time to compute.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group