Hi @rtglorenabasul,
Thanks for sharing the details. The behaviour youโre seeing is consistent with an issue in how the job is bringing up Serverless GPU compute, rather than with the notebook code itself.
Having done some checks, that error usually means the underlying serverless GPU session failed during startup, and the Jobs service couldnโt retrieve a more detailed reason from the compute layer. Thatโs why the same notebook can run fine interactively on Serverless GPU A10, but the scheduled job fails almost immediately.
To help narrow this down, could you try the following:
Check the jobโs compute settings by going to job --> task --> compute in the jobs UI and confirm that compute is explicitly set to Serverless GPU (A10) (not just "Serverless"). Also, make sure the Environment (e.g., Standard v4 or AI v4) matches what you use when the notebook runs successfully interactively.
You can also try a minimal "hello world" GPU job by creating a new notebook, attaching it to Serverless GPU A10, and running the below:
import torch
print("Hello from GPU:", torch.cuda.is_available())
From that notebook, use "Run as job" / "Schedule" to create a job targeting Serverless GPU A10, then run it once.
If even the minimal "hello world" job fails in the same way, that strongly suggests a platform or configuration issue with Serverless GPU + Jobs in your workspace/region, not your specific code. At that point the next step is to open a Support ticket so the Databricks team can look at backend logs.
When you contact Support, please include:
- Workspace URL and workspace ID
- Cloud and region
- Job ID and failing Job Run ID
- Approximate time of the failed run
- A note that the notebook runs successfully on an interactive Serverless GPU A10 session and the scheduled job on Serverless GPU A10 fails at startup with Reason: UNKNOWN (SUCCESS).
That information will help Support route this to the right team and treat it as a potential incident with Serverless GPU jobs, rather than a generic notebook error.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***