Iām running an ingestion pipeline with a Databricks job:
A file lands in S3
A Lambda is triggered
The Lambda runs a Databricks job
The incoming files vary a lot in size, which makes processing times vary as well. My job cluster has autoscaling enabled, but scaling up takes time.
Ideally, if a 10āÆGB file comes in, Iād like the job to start with more than one worker immediately, instead of waiting for autoscaling to kick in.
Iām currently using the run-now API to trigger the job, but I donāt see a way to adjust the job cluster configuration at runtime.
Is there a way to programmatically set the minimum number of workers for a job cluster depending on the incoming file size?