cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Job issue (run was cancelled bydatabricks and spark UI is not available after 10mins)

rpaschenko
New Contributor II

Hi!
We had an issue on 09/19/2023 - we launched job, run was started, but after 10mins it was cancelled with no reasons. The spark ui is not available (which probably means that claster has not been started at all) and I donโ€™t see any logs even.
Could you help please.

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @rpaschenko , 

Here are some steps you can take to troubleshoot the issue:

1. **Check the Jobs UI:** Since the Spark UI is unavailable, check the Databricks Jobs UI, which provides a visual overview of completed job runs, filterable by run status and time. The default time filter covers the previous 48 hours. This might give some insight into why the job was cancelled.
  Source: [Databricks Release Notes](https://docs.databricks.com/release-notes/product/2023/june.html)

2. **Collect Necessary Information:** Before starting the investigation, collect the following information:


  - Notebook URL
  - Cluster URL
  - Consent to run commands
  - Time duration at which the error occurred
  - Executor log corresponding to that particular time

3. **Check for Common Errors:** Look for common errors such as Java.lang.outOfMemoryError: Java heap space in the executor logs.

If such an error is found, it could be that the job was cancelled due to a Java Heap space issue.

In this case, it isn't easy to provide a more specific solution without the logs or any error messages.

If the issue persists, contacting Databricks support with the collected information would be best for further assistance.

-werners-
Esteemed Contributor III

Was it a one time only error or a recurring one?
For the former, I'd check if your vCPU quota was not exceeded, or perhaps there was a temporary issue with the cloud provider,...  Could be a lot of things (lots of moving parts under the hood).

For the latter: we will have to figure out where the problem is located.  code, cluster config, job timing,...
Excluding causes as much as possible.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!