cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I use cluster autoscaling with intensive subprocess calls?

KellenO
New Contributor II

I have a custom application/executable that I upload to DBFS and transfer to my cluster's local storage for execution. I want to call multiple instances of this application in parallel, which I've only been able to successfully do with Python's subprocess.Popen(). However, doing it this way doesn't take advantage of autoscaling.

As a quick code example of what I'm trying to do:

ListOfCustomArguments = ["/path/to/config1.txt", "/path/to/config2.txt"] # Hundreds of custom configurations here
 
processes = []
for arg in ListOfCustomArguments :
   command = "/path/to/executable " + arg
   processes.append(subprocess.Popen(command, shell=True))
 
for p in processes:
   p.wait()
 
print("Done!")

As is, this will not auto-scale. Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

Autoscaling works for spark jobs only. It works by monitoring the job queue, which python code won't go into. If it's just python code, try single node.

https://docs.databricks.com/clusters/configure.html#cluster-size-and-autoscaling

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

Autoscaling works for spark jobs only. It works by monitoring the job queue, which python code won't go into. If it's just python code, try single node.

https://docs.databricks.com/clusters/configure.html#cluster-size-and-autoscaling

Nice response @Joseph Kambourakis​ 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.