Databricks Community

Andolina · ‎10-25-2024

Hello All,

I am trying to fetch data from different sources for tables driven by a metadata table. Data will get fetched from sources using jdbc connector for each table mentioned in the metadata table. A scheduled job is responsible for fetching the data for each table. Now with a huge number of new tables, I want to achieve a faster and effective way of data ingestion using parallel processing. I tried using the Maximum concurrent runs in workflow and I was expecting 6 parallel runs to happen if I put concurrent runs=6. But it shows only one run. Does this happen at executor level? What is the expected outcome of this option Max concurrent run?

elguitar · ‎10-30-2024

Soo.. You use a loop to go through metadata table and then retrieve and ingest files using JDBC?

If so, then the concurrent runs won't be helpful. Concurrent runs means the number of how many runs of that job can be ran side by side. For you, this would probably mean that you would be ingesting the same data 6 times, if you were to run the job 6 times.

If you want to retrieve and ingest those tables concurrently, you can either:

Separate individual table processing to different tasks of the job. If the tasks don't depend on each other, they are ran concurrently.
Use the language-specific concurrency methods. I don't know how your code looks now, so I cannot say more about this option.

If it's easy for you to describe the process as a DAG (directed acyclic graph), I'd say that utilizing Databricks' tasks is pretty straight forward. You could also try out https://docs.databricks.com/en/jobs/for-each.html, but I'm not sure how the concurrency works with that one.

View solution in original post

AngadSingh · ‎10-26-2024

Hi,

It seems the run is getting queued. It might be due to following settings (except the 3rd):

Andolina · ‎10-29-2024

Hi Angad,

No, the runs are not getting queued. As this property is a job level, I was expecting it to run concurrently or get queued, but we can only see 1 run of the workflow always even if concurrent runs is set to 6.

Edthehead · ‎10-29-2024

The Maximum concurrent runs parameter allows multiple runs of the same workflow to be executed in parallel. Since you've switched the queue parameter on, anything higher than 6 will be queued. This is only valid if the same workflow is triggered multiple times.
We can help you better if you provide more details on your workflow setup, how it is triggered. If it 1 workflow or multiple workflows.
You've mentioned that only 1 workflow is running. And you've also mentioned there is a scheduled job for each table. Is it the same job/workflow for all tables or different ones for each? Since you have scheduled your job at a certain time, how is it getting triggered multiple times?
If you've scheduled multiple jobs all using the same notebook and different parameters, the Maximum concurrent runs parameter will not help you.

elguitar · ‎10-30-2024

Soo.. You use a loop to go through metadata table and then retrieve and ingest files using JDBC?

If so, then the concurrent runs won't be helpful. Concurrent runs means the number of how many runs of that job can be ran side by side. For you, this would probably mean that you would be ingesting the same data 6 times, if you were to run the job 6 times.

If you want to retrieve and ingest those tables concurrently, you can either:

Separate individual table processing to different tasks of the job. If the tasks don't depend on each other, they are ran concurrently.
Use the language-specific concurrency methods. I don't know how your code looks now, so I cannot say more about this option.

If it's easy for you to describe the process as a DAG (directed acyclic graph), I'd say that utilizing Databricks' tasks is pretty straight forward. You could also try out https://docs.databricks.com/en/jobs/for-each.html, but I'm not sure how the concurrency works with that one.