cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Python Spark Job - error: job failed with error message The output of the notebook is too large.

lukas_vlk
New Contributor III

Hi databricks experts. I am currently facing a problem with a submitted job run on Azure Databricks. Any help on this is very welcome. See below for details:

Problem Description:

I submitted a python spark task via the databricks cli (v0.16.4) to Azure Databricks REST API (v2.0) to run on a new job cluster. See atteched job.json for the cluster configuration. The job runs successfully and all outputs are generated as expected. Despite that, the job failes with an error message saying that "The output of the notebook is too large".

My questions regarding this problem are:

- Why is the job submitted as a spark python task displaying an error message related to notebook tasks ?

- Why is the job failing even though the log output does not exceed to limit ? (See below for details)

What did I expect to see:

Successful completion of the job with no errors

What did I see:

The job failed with an Error Message displaying "Run result unavailable: job failed with error message The output of the notebook is too large."

Already done steps:

1. Consulted Azure and databricks documentation for a possible error cause. See:

According to documentation this error occurs, if the stdout logs exceed 20 MB.

Actual stdout log output size: 1.8 MB

2. Increased py4j log level to reduce stdout log output

logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)

Reduced stdout log output size: 390 KB

3. Used log4j to write application logs

1 ACCEPTED SOLUTION

Accepted Solutions

lukas_vlk
New Contributor III

Without any further changes from my side, the error has disappeard since 29.03.2022

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

Output is usually something related to print() collect() etc

In the documentation which you mentioned is spark config command to remove totally stdout (spark.databricks.driver.disableScalaOutput true). I know that it is not what you want to use but maybe it could be helpful to diagnose is a problem with logs or script output.

Not many people use spark_python_task rather all use notebooks (eventually together with files in repos or wheel) so maybe someone from inside databricks would need to help.

lukas_vlk
New Contributor III

Thanks for the answer 🙂

After writing the question, I tested using "spark.databricks.driver.disableScalaOutput": "true". Unfortunately this did not help with solving the problem.

Regarding "collect()" we are running the job with 0 executers, as we are only using spark to load some parquet datasets that are then processed in python. We are however using "spark.sql.execution.arrow.pyspark.enabled": "true" to improve the performance during the conversion to pandas from the spark DataFrames. Increasing "spark.driver.memory" and "spark.driver.maxResultSize" did not help either.

lukas_vlk
New Contributor III

Without any further changes from my side, the error has disappeard since 29.03.2022

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group