cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Want to use DataFrame equality functions but also Numpy >= 2.0

Y_WANG
New Contributor II

In my team, we has a lot of Data science workflow using Spark and Pandas. In order to rassure the stability of workflows, we need to implement the unit test. Recently, I found out the DataFrame equality test functions introduced in Spark 3.5 which seems be easy to use. But while trying to import the asserDataFrameEqual, I got an AttributeError because of Numpy:

from pyspark.testing import assertDataFrameEqual
File /opt/spark/python/pyspark/pandas/strings.py:1332, in StringMethods()
   1328         return s.str.ljust(width, fillchar)
   1330     return self._data.pandas_on_spark.transform_batch(pandas_ljust)
-> 1332 def match(self, pat: str, case: bool = True, flags: int = 0, na: Any = np.NaN) -> "ps.Series":
   1333     """
   1334     Determine if each string matches a regular expression.
   1335 
   (...)   1390     dtype: object
   1391     """
   1393     def pandas_match(s) -> ps.Series[bool]:  # type: ignore[no-untyped-def]
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.

 Which is very shameful as Numpy has published the version 2.0 since Juin 2024, and this library is an important dependencies with Pandas > 2.0 out Data Science environnement, we can not downgrade it just for using this test function.
Is there any solution or best practices for using the asserDataFrameEqual in test unit please ? 

1 ACCEPTED SOLUTION

Accepted Solutions

ManojkMohan
Honored Contributor II

@Y_WANG  The root cause of the AttributeError you face when importing assertDataFrameEqual from pyspark.testing in Spark 3.5 is due to Spark's code using the deprecated np.NaN attribute, which was removed in NumPy 2.0 (replaced by np.nan). This breakage occurs because Spark 3.5 testing utilities still reference np.NaN

Alternative Testing Options
Consider external libraries like chispa for DataFrame equality testing while awaiting fixes:
https://github.com/databrickslabs/chispa

Use pandas.testing.assert_frame_equal carefully for pandas or pandas-on-Spark DataFrames, especially handling NaN equality explicitly.

View solution in original post

2 REPLIES 2

ManojkMohan
Honored Contributor II

@Y_WANG  The root cause of the AttributeError you face when importing assertDataFrameEqual from pyspark.testing in Spark 3.5 is due to Spark's code using the deprecated np.NaN attribute, which was removed in NumPy 2.0 (replaced by np.nan). This breakage occurs because Spark 3.5 testing utilities still reference np.NaN

Alternative Testing Options
Consider external libraries like chispa for DataFrame equality testing while awaiting fixes:
https://github.com/databrickslabs/chispa

Use pandas.testing.assert_frame_equal carefully for pandas or pandas-on-Spark DataFrames, especially handling NaN equality explicitly.

Y_WANG
New Contributor II

Thanks for the answer, I am still surprised that this dependency has not yet been fixed. We have custom comparison methods in our project for the moment, I just want to replace them by official method because I want to make package lighter 

thanks a lot;) 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now