cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

org.apache.spark.SparkException: Job aborted due to stage failure:

Manmohan_Nayak
New Contributor

Hi

I have around 20 million records in my DF, and want to save it in HORIZINTAL SQL DB.

This is error:

org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 1525 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.

Here is my code:

df.write.format("jdbc").options( **DB_PROPS, **extra_options, dbtable=table, truncate=truncate).mode(mode).save()

Any opinion what can go wrong?

Regards

 

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @Manmohan_Nayak, The error message you’re encountering indicates an issue during the execution of your Spark job. 

 

Let’s break it down and explore potential reasons for this error:

 

Indeterminate Output:

  • The error mentions a “shuffle map stage with indeterminate output.” This typically occurs during data shuffling, where Spark partitions data across nodes.
  • Possible causes:
    • Data Skew: Uneven distribution of data across partitions can lead to indeterminacy during shuffling.
    • Complex Transformations: Complex transformations or joins might cause issues.

Checkpointing and Repartitioning:

  • The error suggests checkpointing the RDD before repartitioning.
  • Checkpointing creates a stable point in the lineage of RDDs, reducing recomputation risk.
  • Consider checkpointing your RDD before performing any repartitioning operations.

Spark Configuration:

  • Ensure that your Spark configuration is appropriate for handling large datasets.
  • Consider adjusting parameters like spark.executor.memory, spark.driver.memory, and spark.sql.shuffle.partitions.

Delta Lake (Optional):

  • If you’re using Databricks, consider using Delta Lake for better performance and reliability.
  • Delta Lake optimizes data storage, supports ACID transactions, and provides features like Z-ordering and partition pruning.
  • You can create a Delta table from your DataFrame and write data to it.

Here are some recommendations:

 

Check Data Distribution:

  • Investigate the distribution of data across partitions. Use df.rdd.getNumPartitions() to check the number of partitions.
  • If skewed, consider repartitioning the DataFrame evenly.

Checkpointing:

  • Add a checkpoint before writing to the JDBC sink:df.checkpoint() df.write.format("jdbc").options(**DB_PROPS, **extra_options, dbtable=table, truncate=truncate).mode(mode).save()

Spark Configuration:

  • Adjust Spark configuration based on your cluster resources and workload.
  • Increase spark.sql.shuffle.partitions if needed.

Delta Lake (Optional):

  • Create a Delta table:df.write.format("delta").mode("overwrite").saveAsTable("my_delta_table")
  • Query the Delta table using Databricks SQL endpoints.

aniketg
New Contributor II

@Manmohan_Nayak If the resolution worked for you?
I am facing the same error from last couple of days for the job which was working earlier 

Dusan
New Contributor II

facing same issue since we moved from Spark 3.2.1 (databricks 10.4) to Spark 3.3.2 (databricks 12.2), how come we have seen this problem before, now we do.. is it Spark related or Databricks related (autoscaling?)

VZLA
New Contributor II
New Contributor II

If there are any failures which may lead to a stage retry, but retrying the stage translates into potentially having an inconsistent result (indeterminacy) then this exception is raised. The exception is raised in newer version where the validation is performed, likely unavailable in DBR 10.4 and older versions.

To address the problem, you may as per the error message, checkpoint the DF before the indeterminacy is introduce.

This can be commonly seen in scenarios where there are nodes lost, for example due to spot instance termination, or similar events, not fully sure about a scaling down event, but could also be another reason.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!