Working on an ingestion workflow in databricks which extracts data from on-prem sources in databricks following all standard practices of incremental load, indempotency, upsert, schema evolution etc and storing data properly.
Now we want to optimize the workflow for scale where we have more than 1000s of tables. Let's say for one source now.
We are using for each feature of databricks workflows to run 100 ingestionl oad tasks in parallel challenge comes where we have more than 100 and the max concurrent tasks can be 100.
What should be the approach to handle this -
Some options we thought about
In our metadata table we have a column for priority we can use that priority column to group the tables where each workflow handle a particular priority. Then fetches table info according this
Now to run these multi priority workflow we can think of two approaches either create separate workflow for each priority and run them or
creating a master workflow which captures table according to priority orchestrates and passes them to child workflow where other filtering happens and passes the list in for each task where incremental ingestion happens.
we also came to know about concurrent workflow runs.
What will be the best approach to move forward considering we will be working on massive scale. Parallization is the preferred approach but considering all its pros and cons and how we will handle the compute cluster configuration etc.