Solution Design for an ingestion workflow with 1000s of tables for each source

ashraf1395 — Mon, 20 Jan 2025 06:58:48 GMT

Working on an ingestion workflow in databricks which extracts data from on-prem sources in databricks following all standard practices of incremental load, indempotency, upsert, schema evolution etc and storing data properly.

Now we want to optimize the workflow for scale where we have more than 1000s of tables. Let's say for one source now.

We are using for each feature of databricks workflows to run 100 ingestionl oad tasks in parallel challenge comes where we have more than 100 and the max concurrent tasks can be 100.

What should be the approach to handle this -
Some options we thought about

In our metadata table we have a column for priority we can use that priority column to group the tables where each workflow handle a particular priority. Then fetches table info according this

Now to run these multi priority workflow we can think of two approaches either create separate workflow for each priority and run them or

creating a master workflow which captures table according to priority orchestrates and passes them to child workflow where other filtering happens and passes the list in for each task where incremental ingestion happens.

we also came to know about concurrent workflow runs.

What will be the best approach to move forward considering we will be working on massive scale. Parallization is the preferred approach but considering all its pros and cons and how we will handle the compute cluster configuration etc.

Re: Solution Design for an ingestion workflow with 1000s of tables for each source

Avinash_Narala — Mon, 20 Jan 2025 11:42:47 GMT

I do did the similar kind of work in my recent project, where I need to run many SQL DDL's , so I automated the process using databricks jobs, capturing the dependency using a metadata table and creating tasks likewise in job through job api's, doing so we reduced the runtime and resource consumption significantly by 33%.

So, Instead of creating separate workflow/job for each priority task which would result in too many jobs and not recommended too, we can create a master workflow/job which captures table according to priority orchestrates and passes them to child workflow where other filtering happens and passes the list in for each task where incremental ingestion happens is the best approach and also make sure to provision the job cluster with required compute memory, so that it'll not take so much time to compute.

topic Solution Design for an ingestion workflow with 1000s of tables for each source in Data Engineering

Solution Design for an ingestion workflow with 1000s of tables for each source

Re: Solution Design for an ingestion workflow with 1000s of tables for each source