cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Programmatically set minimum workers for a job cluster based on file size?

Alena
New Contributor II

Iโ€™m running an ingestion pipeline with a Databricks job:

  1. A file lands in S3

  2. A Lambda is triggered

  3. The Lambda runs a Databricks job

The incoming files vary a lot in size, which makes processing times vary as well. My job cluster has autoscaling enabled, but scaling up takes time.

Ideally, if a 10โ€ฏGB file comes in, Iโ€™d like the job to start with more than one worker immediately, instead of waiting for autoscaling to kick in.

Iโ€™m currently using the run-now API to trigger the job, but I donโ€™t see a way to adjust the job cluster configuration at runtime.

Is there a way to programmatically set the minimum number of workers for a job cluster depending on the incoming file size?

 

1 REPLY 1

kerem
Contributor

Hi Alena, 

Jobs API has update functionality to be able to do that: https://docs.databricks.com/api/workspace/jobs_21/update

If for some reason you canโ€™t update your pipeline before you trigger it you can also consider creating a new job with desired configuration every time you run a trigger (POST /api/2.2/jobs/create). 

 

Kerem Durak