cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Error running 80 task at same time in Job, how limit this?

Maxi1693
New Contributor II

Hi! I have a Job running to process multiple streaming tables. 

In the beginning, it was working fine, but now I have 80 tables running in this job, the problem is that all the runs are trying to run at the same time throwing an error. Is there a way to limit the number of tasks that a job can execute per each run?

I have configured the Job cluster with autoscaling from 2 to 4. I thought it worked as a limitation to run per each of the 4 tasks, but I was wrong.

The error I am getting is "Failure starting repl. Try detaching and re-attaching the notebook.", as I could find it is because the cluster is overloaded, but I can not limit the number of runs in parallel. 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Maxi1693, It appears that you’re encountering issues with parallel execution of tasks in your Databricks job.

Let’s address this by considering a few strategies:

  1. Concurrency Limit for Tasks:

  2. Job Creation Rate:

  3. Autoscaling and Cluster Configuration:

    • While autoscaling helps manage cluster resources dynamically, it doesn’t inherently limit the number of concurrent runs.
    • Instead of relying solely on autoscaling, consider setting a fixed cluster size (e.g., 4 workers) to ensure consistent resources for your tasks.
    • Additionally, configure the maximum number of concurrent tasks per worker node. You can set this in the cluster configuration under “Max Concurrency” or “Max Tasks”. Adjust this value based on your workload and available resources.
  4. Notebook Execution Order:

    • If your streaming tables are processed within notebooks, ensure that they execute sequentially rather than concurrently.
    • You can use notebook workflows or job dependencies to control the order of execution.
  5. Cluster Overload and REPL Errors:

    • The “Failure starting repl” error often occurs when the cluster is overloaded due to excessive concurrent tasks.
    • Detaching and re-attaching the notebook might temporarily alleviate the issue, but it’s not a long-term solution.
    • Focus on managing concurrency and resource allocation to prevent cluster overload.

Remember to monitor your job execution patterns, adjust cluster settings, and stagger your tasks to maintain a balance between parallelism and resource availability.

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.