cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Programmatically set minimum workers for a job cluster based on file size?

Alena
New Contributor II

I’m running an ingestion pipeline with a Databricks job:

  1. A file lands in S3

  2. A Lambda is triggered

  3. The Lambda runs a Databricks job

The incoming files vary a lot in size, which makes processing times vary as well. My job cluster has autoscaling enabled, but scaling up takes time.

Ideally, if a 10 GB file comes in, I’d like the job to start with more than one worker immediately, instead of waiting for autoscaling to kick in.

I’m currently using the run-now API to trigger the job, but I don’t see a way to adjust the job cluster configuration at runtime.

Is there a way to programmatically set the minimum number of workers for a job cluster depending on the incoming file size?

 

1 REPLY 1

kerem
Contributor

Hi Alena, 

Jobs API has update functionality to be able to do that: https://docs.databricks.com/api/workspace/jobs_21/update

If for some reason you can’t update your pipeline before you trigger it you can also consider creating a new job with desired configuration every time you run a trigger (POST /api/2.2/jobs/create). 

 

Kerem Durak

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now