05-13-2025 11:53 PM
Hi, I've been running a job on Azure Databricks serverless, which just does some batch data processing every 4 hours. This job, deployed with bundles has been running fine for weeks, and all of a sudden, yesterday, it started failing with an error that doesn't point to any real line in the source files. This time it points to this file, another run points to another one.
Running on a job compute doesn't generate this error.
Any idea of the cause of this? This makes many jobs fail in one workspace, whereas the same jobs in another workspace, same config, run fine.
05-16-2025 05:37 AM
@Shua42 , strange thing, all serverless tests started passing again today, so I redeployed all bundles as serverless jobs, and it is working again. Does this sound related to a bug Databricks found and fixed this week?
05-16-2025 07:04 AM
Hey @thibault ,
Glad to hear it is working again. I don't see any specific mention of a bug internally that would be related to this, but it is likely that it was due to a change in the underlying runtime for serverless compute.
This may be one of the tradeoffs you should consider with serverless vs. standard jobs compute. The lack of a need to manage the environment with serverless does help reduce the maintenance overhead, but could lead to inconsistent dependency issues with your code as you don't have as much control over the environment.
05-14-2025 09:03 AM
Hey @thibault ,
One possibility is that this could be due to an update of the underlying Databricks runtime for serverless compute, which could have affected a dependency and is now causing differing behavior.
It's hard to say without knowing what the data and code looks like, but I think it would also be good to double check that there isn't a data issue that could have caused this.
My recommendation for now would be to run it with job compute as it's a fixed runtime, and try to debug each task to get a better sense of what specific logic is causing the failure. If there is a strict dependency issue, job compute may be a better option for you.
05-15-2025 02:00 AM
Hi @Shua42 , I am switching back to job compute for now in prod.
The exact same jobs, reading the same data from UC, just in 2 different workspaces, and the one in the dev workspace runs just fine, whereas the one in prod is failing. Also, the error seems inconsistent, it complains about a non existing line of code from an empty __init__.py file that looks like a log timestamp, and another job is failing due to a seemingly a circular import. This all happened overnight with the latest code changes happening weeks ago.
I'll file a bug as this seems unrelated to our setup.
05-15-2025 08:56 AM
@Shua42 , I was able to reproduce the error running a notebook from the bundle file structure.
The interesting thing is that if I clone the whole content of the folder under .bundle, and run the notebook from that new structure, it no longer fails.
Deleting the bundle and redeploying does not help, and renaming the clone re-triggers the error. Not sure if that helps, but I'll keep testing things out.
05-16-2025 05:37 AM
@Shua42 , strange thing, all serverless tests started passing again today, so I redeployed all bundles as serverless jobs, and it is working again. Does this sound related to a bug Databricks found and fixed this week?
05-16-2025 07:04 AM
Hey @thibault ,
Glad to hear it is working again. I don't see any specific mention of a bug internally that would be related to this, but it is likely that it was due to a change in the underlying runtime for serverless compute.
This may be one of the tradeoffs you should consider with serverless vs. standard jobs compute. The lack of a need to manage the environment with serverless does help reduce the maintenance overhead, but could lead to inconsistent dependency issues with your code as you don't have as much control over the environment.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now