cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Serverless Job : sudden random failure

thibault
Contributor III

Hi, I've been running a job on Azure Databricks serverless, which just does some batch data processing every 4 hours. This job, deployed with bundles has been running fine for weeks, and all of a sudden, yesterday, it started failing with an error that doesn't point to any real line in the source files. This time it points to this file, another run points to another one.

Running on a job compute doesn't generate this error.

thibault_0-1747205505731.png

Any idea of the cause of this? This makes many jobs fail in one workspace, whereas the same jobs in another workspace, same config, run fine.

2 ACCEPTED SOLUTIONS

Accepted Solutions

thibault
Contributor III

@Shua42 , strange thing, all serverless tests started passing again today, so I redeployed all bundles as serverless jobs, and it is working again. Does this sound related to a bug Databricks found and fixed this week?

View solution in original post

Shua42
Databricks Employee
Databricks Employee

Hey @thibault ,

Glad to hear it is working again. I don't see any specific mention of a bug internally that would be related to this, but it is likely that it was due to a change in the underlying runtime for serverless compute.

This may be one of the tradeoffs you should consider with serverless vs. standard jobs compute. The lack of a need to manage the environment with serverless does help reduce the maintenance overhead, but could lead to inconsistent dependency issues with your code as you don't have as much control over the environment.

View solution in original post

5 REPLIES 5

Shua42
Databricks Employee
Databricks Employee

Hey @thibault ,

One possibility is that this could be due to an update of the underlying Databricks runtime for serverless compute, which could have affected a dependency and is now causing differing behavior.

It's hard to say without knowing what the data and code looks like, but I think it would also be good to double check that there isn't a data issue that could have caused this.

My recommendation for now would be to run it with job compute as it's a fixed runtime, and try to debug each task to get a better sense of what specific logic is causing the failure. If there is a strict dependency issue, job compute may be a better option for you.

thibault
Contributor III

Hi @Shua42 , I am switching back to job compute for now in prod.

The exact same jobs, reading the same data from UC, just in 2 different workspaces, and the one in the dev workspace runs just fine, whereas the one in prod is failing. Also, the error seems inconsistent, it complains about a non existing line of code from an empty __init__.py file that looks like a log timestamp, and another job is failing due to a seemingly a circular import. This all happened overnight with the latest code changes happening weeks ago.

I'll file a bug as this seems unrelated to our setup.

thibault
Contributor III

@Shua42 , I was able to reproduce the error running a notebook from the bundle file structure.

The interesting thing is that if I clone the whole content of the folder under .bundle, and run the notebook from that new structure, it no longer fails.

Deleting the bundle and redeploying does not help, and renaming the clone re-triggers the error. Not sure if that helps, but I'll keep testing things out. 

thibault
Contributor III

@Shua42 , strange thing, all serverless tests started passing again today, so I redeployed all bundles as serverless jobs, and it is working again. Does this sound related to a bug Databricks found and fixed this week?

Shua42
Databricks Employee
Databricks Employee

Hey @thibault ,

Glad to hear it is working again. I don't see any specific mention of a bug internally that would be related to this, but it is likely that it was due to a change in the underlying runtime for serverless compute.

This may be one of the tradeoffs you should consider with serverless vs. standard jobs compute. The lack of a need to manage the environment with serverless does help reduce the maintenance overhead, but could lead to inconsistent dependency issues with your code as you don't have as much control over the environment.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now