Databricks Community

tgburrin-afs · ‎08-09-2024

I have a job with > 10 tasks in it that interacts with an external system outside of databricks. At the moment that external system cannot handle more than 3 of the tasks executing concurrently. How can I limit the number of tasks that concurrently execute in a job? I'm not particularly worried about the order in which they execute, only that the number at any one time is limited to 3.

The cluster that I execute this on currently has only 1 worker in it and I'm looking to limit what takes place on that single worker.

Mounika_Tarigop · 2 weeks ago

To limit the number of tasks that concurrently execute in a job to 3, you can use the max_concurrent_runs parameter in your job configuration. This parameter allows you to specify the maximum number of concurrent runs for a job, ensuring that no more than the specified number of tasks run at the same time.

When creating or updating your job, set the max_concurrent_runs parameter to 3. This will limit the number of concurrent tasks to 3.

The max_concurrent_runs parameter will handle the concurrency limit regardless of the cluster size.

filipniziol · 2 weeks ago

Hi @tgburrin-afs, @Mounika_Tarigop ,

As I understand the question is about running concurrent tasks within a single job rather than running concurrent jobs.

max_concurrent_runs controls how many times a whole job can run simultaneously, not the concurrency of tasks within a single job run.

There is currently no direct feature in Databricks Jobs to specify a maximum number of concurrently running tasks within a single job run. Instead, you need to control concurrency through task dependencies or application logic.

Approaches to Limit Concurrent Tasks Within a Single Job Run

Use Task Dependencies to Limit Parallelism:
Structure your job so that no more than three tasks run at the same "layer." For example:
- Suppose you have 12 tasks total. Instead of having all 12 start at once, arrange them in four "waves" of three tasks each.
- In the Job UI or JSON configuration:
  - Start with three tasks (A, B, C) that have no upstream dependencies. They run simultaneously.
  - The next set of three tasks (D, E, F) only start after A, B, and C all complete.
  - Repeat this pattern until all tasks have run. This ensures that at most three tasks are active at the same time.
Implement Concurrency Control in Your Code:
- If each Databricks task itself runs code that can executes operations against the external system, implement concurrency control in your code logic.
- For example, if a single task processes multiple items and you don’t want more than three operations to hit the external system concurrently, use a thread pool within the task’s code:

from concurrent.futures import ThreadPoolExecutor

# Limit to 3 concurrent operations
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(process_item, item) for item in items_to_process]
    results = [f.result() for f in futures]This approach requires merging multiple pieces of logic into a single task and controlling concurrency at the code level.

This approach requires merging multiple pieces of logic into a single task and controlling concurrency at the code level.

Databricks Community

Limiting concurrent tasks in a job

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences