Iโm running an ingestion pipeline with a Databricks job:
A file lands in S3
A Lambda is triggered
The Lambda runs a Databricks job
The incoming files vary a lot in size, which makes processing times vary as well. My job cluster has autoscaling enabled, but scaling up takes time.
Ideally, if a 10โฏGB file comes in, Iโd like the job to start with more than one worker immediately, instead of waiting for autoscaling to kick in.
Iโm currently using the run-now API to trigger the job, but I donโt see a way to adjust the job cluster configuration at runtime.
Is there a way to programmatically set the minimum number of workers for a job cluster depending on the incoming file size?