Databricks Community

tgburrin-afs · ‎08-09-2024

I have a job with > 10 tasks in it that interacts with an external system outside of databricks. At the moment that external system cannot handle more than 3 of the tasks executing concurrently. How can I limit the number of tasks that concurrently execute in a job? I'm not particularly worried about the order in which they execute, only that the number at any one time is limited to 3.

The cluster that I execute this on currently has only 1 worker in it and I'm looking to limit what takes place on that single worker.

Mounika_Tarigop · ‎12-06-2024

To limit the number of tasks that concurrently execute in a job to 3, you can use the max_concurrent_runs parameter in your job configuration. This parameter allows you to specify the maximum number of concurrent runs for a job, ensuring that no more than the specified number of tasks run at the same time.

When creating or updating your job, set the max_concurrent_runs parameter to 3. This will limit the number of concurrent tasks to 3.

The max_concurrent_runs parameter will handle the concurrency limit regardless of the cluster size.

filipniziol · ‎12-07-2024

Hi @tgburrin-afs, @Mounika_Tarigop ,

As I understand the question is about running concurrent tasks within a single job rather than running concurrent jobs.

max_concurrent_runs controls how many times a whole job can run simultaneously, not the concurrency of tasks within a single job run.

There is currently no direct feature in Databricks Jobs to specify a maximum number of concurrently running tasks within a single job run. Instead, you need to control concurrency through task dependencies or application logic.

Approaches to Limit Concurrent Tasks Within a Single Job Run

Use Task Dependencies to Limit Parallelism:
Structure your job so that no more than three tasks run at the same "layer." For example:
- Suppose you have 12 tasks total. Instead of having all 12 start at once, arrange them in four "waves" of three tasks each.
- In the Job UI or JSON configuration:
  - Start with three tasks (A, B, C) that have no upstream dependencies. They run simultaneously.
  - The next set of three tasks (D, E, F) only start after A, B, and C all complete.
  - Repeat this pattern until all tasks have run. This ensures that at most three tasks are active at the same time.
Implement Concurrency Control in Your Code:
- If each Databricks task itself runs code that can executes operations against the external system, implement concurrency control in your code logic.
- For example, if a single task processes multiple items and you don’t want more than three operations to hit the external system concurrently, use a thread pool within the task’s code:

from concurrent.futures import ThreadPoolExecutor

# Limit to 3 concurrent operations
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(process_item, item) for item in items_to_process]
    results = [f.result() for f in futures]This approach requires merging multiple pieces of logic into a single task and controlling concurrency at the code level.

This approach requires merging multiple pieces of logic into a single task and controlling concurrency at the code level.

_J · ‎04-19-2025

Same thing here; job concurrency is good but nothing for task; some jobs we do have countless parallel tasks so by not controlling it the downstream servers goes to a grinding halt and tasks just terminate.

It needs what we call a spinlock on tasks to control the concurrency in a global shared space. If you case allows for a python wrapper we can do something like this:

1. Create a volume;

2. Ensure each task has a unique key of sorts; use a parameter;

3. Do a spinlock on the volume with something like this:

import time
import random

#############
# Concurrency
#############
aat_vol="/Volumes/{your_volume_path}"
aat=1
def saat(s):
    time.sleep(1+random.randrange(1,10))
    ready=False
    while ready==False:
        list=dbutils.fs.ls(aat_vol)
        if len(list)<aat:
            dbutils.fs.mkdirs(aat_vol+"/"+s)
            ready=True
            break
        else:
            time.sleep(1+random.randrange(1,10))

def eaat(s):
    dbutils.fs.rm(aat_vol+"/"+s)

...
try:
    eaat(target)
    saat(target)
    ...
except Exception as e:
    eaat(target)
    raise e
eaat(target)

aat_vol; is the path to your volume
aat: is the number of concurrent tasks you want to allow
saat: starts a task with the given key
eaat: ends a task with a given

The random is used to ensure they do not spin up on the same time in some cases 1/10 probability you will still have more than one initial spins; always ensure you call eaat to clear the volume even if it fails.

smoortema · ‎07-22-2025

I have a similar question, but I am more interested in it to understand whether resources are allocated in an intelligent way. I guess the compute configuration should have an effect on how many tasks can be run at the same time. But how does it exactly effect it?

My question is the following: suppose task A and B depends on task C and D. Other then this, there are 50 other tasks that do not have dependencies. Suppose that in the beginning, task C and D are started, and, if there is a limit of parallelism, 8 more tasks. Then task C is over, but D is running, so A and B cannot start yet. My question is: is the orchestration intelligent enough to start a new (9th) task from the 50 while waiting for D to finish? Or does it wait and waste resources in the meantime?

_J · 7 hours ago

Hi smoortema, it will wait as for resource waste the sleep is the least expensive thing you can do it's impact on costing you'd need to gauge but it should be minimal. If you want zero data bricks waste your best option is to wire the 50 tasks in some sequential pattern this will then lead to longer run times and likely the opposite: underutilizing the downstream servers and not meeting the sla which is also waste; so we wire the sequential tasks in slots of 3 maybe which gives you a good balance but then you get 50 more of them and have to re-wire the whole situation again; ideally you'd want to balance the concurrency to some degree: you don't want to run the 50 tasks concurrently as that will kill the downstream server (my case) but also don't want to run them 1 by 1 as you'd not be utilizing the downstream server properly and likely miss the delivery times.

It's worth noting that in the new release there's the ability to only start a task if a table changes for example which is a great step in the right direction but in most cases our pre-requisites for starting a specific task is a bit more complex than that.

smoortema · 6 hours ago

Thanks!

What do you mean by this sentence? What do slots mean in this context?

"we wire the sequential tasks in slots of 3 maybe which gives you a good balance but then you get 50 more of them and have to re-wire the whole situation again"