cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Workflow concurrent runs not working as expected

Andolina
New Contributor II

Hello All,

I am trying to fetch data from different sources for tables driven by a metadata table. Data will get fetched from sources using jdbc connector for each table mentioned in the metadata table. A scheduled job is responsible for fetching the data for each table. Now with a huge number of new tables, I want to achieve a faster and effective way of data ingestion using parallel processing. I tried using the Maximum concurrent runs in workflow and I was expecting 6 parallel runs to happen if I put concurrent runs=6. But it shows only one run.  Does this happen at executor level? What is the expected outcome of this option Max concurrent run?

1 ACCEPTED SOLUTION

Accepted Solutions

elguitar
New Contributor III

Soo.. You use a loop to go through metadata table and then retrieve and ingest files using JDBC?

If so, then the concurrent runs won't be helpful. Concurrent runs means the number of how many runs of that job can be ran side by side. For you, this would probably mean that you would be ingesting the same data 6 times, if you were to run the job 6 times.

If you want to retrieve and ingest those tables concurrently, you can either:

  1. Separate individual table processing to different tasks of the job. If the tasks don't depend on each other, they are ran concurrently.
  2. Use the language-specific concurrency methods. I don't know how your code looks now, so I cannot say more about this option.

If it's easy for you to describe the process as a DAG (directed acyclic graph), I'd say that utilizing Databricks' tasks is pretty straight forward. You could also try out https://docs.databricks.com/en/jobs/for-each.html, but I'm not sure how the concurrency works with that one.

View solution in original post

4 REPLIES 4

AngadSingh
New Contributor III

Hi,

It seems the run is getting queued. It might be due to following settings (except the 3rd):

1000037335.png

Andolina
New Contributor II

Hi Angad,

No, the runs are not getting queued. As this property is a job level, I was expecting it to run concurrently or get queued, but we can only see 1 run of the workflow always even if concurrent runs is set to 6.

 

Edthehead
Contributor II

The Maximum concurrent runs parameter allows multiple runs of the same workflow to be executed in parallel. Since you've switched the queue parameter on, anything higher than 6 will be queued. This is only valid if the same workflow is triggered multiple times. 
We can help you better if you provide more details on your workflow setup, how it is triggered. If it 1 workflow or multiple workflows.
You've mentioned that only 1 workflow is running. And you've also mentioned there is a scheduled job for each table. Is it the same job/workflow for all tables or different ones for each? Since you have scheduled your job at a certain time, how is it getting triggered multiple times?
If you've scheduled multiple jobs all using the same notebook and different parameters, the Maximum concurrent runs parameter will not help you.

elguitar
New Contributor III

Soo.. You use a loop to go through metadata table and then retrieve and ingest files using JDBC?

If so, then the concurrent runs won't be helpful. Concurrent runs means the number of how many runs of that job can be ran side by side. For you, this would probably mean that you would be ingesting the same data 6 times, if you were to run the job 6 times.

If you want to retrieve and ingest those tables concurrently, you can either:

  1. Separate individual table processing to different tasks of the job. If the tasks don't depend on each other, they are ran concurrently.
  2. Use the language-specific concurrency methods. I don't know how your code looks now, so I cannot say more about this option.

If it's easy for you to describe the process as a DAG (directed acyclic graph), I'd say that utilizing Databricks' tasks is pretty straight forward. You could also try out https://docs.databricks.com/en/jobs/for-each.html, but I'm not sure how the concurrency works with that one.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group