cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

handling both Pyspark and Python exceptions

smoortema
New Contributor III

In a Python notebook, I am using error handling according to the official documentation.  

try:
[some data transformation steps]
except PySparkException as ex:
[logging steps to log the error condition and error message in a table]

However, this catches only PySpark exception classes, and not all exceptions in the code. To catch the remaining exceptions, I would need to add the Python exception handler:

try:
[some data transformation steps]
except Exception as ex:
[logging steps to log the error condition and error message in a table]

My aim is to be able to handle both exceptions. If I use only the except PySparkException, it does not catch Python exceptions, if I use only except Exception, I cannot use the error condition and message for logging. I tried to do both, by embedding the try - PySparkException in a try - Exception. This works well for Python exceptions, they are not caught by PySparkException, but are then caught by Exception. However, in case of PySparkException, they are caught twice which makes logging complicated (error code and error message is overwritten by the second exception handler).

Is there a way to define the outer, Python Exception in a way that it only catches exceptions that are not caught by PySparkException?

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

To handle both PySpark exceptions and general Python exceptions without double-logging or overwriting error details, the recommended approach is to use multiple except clauses that distinguish the exception type clearly. In Python, exception handlers are checked in order—they will catch only the exceptions matching their type. PySparkException is a subclass of Exception, so if “except Exception” comes first, it will catch everything; but if “except PySparkException” is first, only PySparkException errors are caught there, and all other exceptions fall through to the next handler. This type-specific ordering ensures that PySparkException errors are only processed once, while other exceptions are handled separately.

Here’s the idiomatic pattern to solve this issue:

try:
# some data transformation steps
except PySparkException as ex:
# log error condition and message using ex.getErrorClass(), ex.getMessageParameters(), ex.getSqlState(), etc.
except Exception as ex:
# log that a non-PySpark error occurred, using ex.__class__.__name__ and str(ex)

This way, a PySparkException will never fall into the Exception handler—the first except handles it, and the block exits. Only exceptions not inheriting from PySparkException will be handled by the second except.

 

You can also try the following:

from pyspark.errors import PySparkException

try:
# your data transformation code here
except PySparkException as ex:
error_condition = ex.getErrorClass()
msg_params = ex.getMessageParameters()
sqlstate = ex.getSqlState()
# log all details
except Exception as ex:
# log ex.__class__.__name__ and str(ex)

 

By stacking exception handlers from most specific to most general, both types are captured correctly, without duplicate handling or lost error context.

This mechanism is explicitly supported and recommended according to the official Databricks error handling documentation.

View solution in original post

2 REPLIES 2

mark_ott
Databricks Employee
Databricks Employee

To handle both PySpark exceptions and general Python exceptions without double-logging or overwriting error details, the recommended approach is to use multiple except clauses that distinguish the exception type clearly. In Python, exception handlers are checked in order—they will catch only the exceptions matching their type. PySparkException is a subclass of Exception, so if “except Exception” comes first, it will catch everything; but if “except PySparkException” is first, only PySparkException errors are caught there, and all other exceptions fall through to the next handler. This type-specific ordering ensures that PySparkException errors are only processed once, while other exceptions are handled separately.

Here’s the idiomatic pattern to solve this issue:

try:
# some data transformation steps
except PySparkException as ex:
# log error condition and message using ex.getErrorClass(), ex.getMessageParameters(), ex.getSqlState(), etc.
except Exception as ex:
# log that a non-PySpark error occurred, using ex.__class__.__name__ and str(ex)

This way, a PySparkException will never fall into the Exception handler—the first except handles it, and the block exits. Only exceptions not inheriting from PySparkException will be handled by the second except.

 

You can also try the following:

from pyspark.errors import PySparkException

try:
# your data transformation code here
except PySparkException as ex:
error_condition = ex.getErrorClass()
msg_params = ex.getMessageParameters()
sqlstate = ex.getSqlState()
# log all details
except Exception as ex:
# log ex.__class__.__name__ and str(ex)

 

By stacking exception handlers from most specific to most general, both types are captured correctly, without duplicate handling or lost error context.

This mechanism is explicitly supported and recommended according to the official Databricks error handling documentation.

smoortema
New Contributor III

Great, thanks, I was not familiar with stacking exception handlers from most specific to most general, so I had two try rows in the beginning instead of only one.