topic Re: Want to use DataFrame equality functions but also Numpy >= 2.0 in Data Engineering

Want to use DataFrame equality functions but also Numpy >= 2.0

Y_WANG — Thu, 06 Nov 2025 15:56:29 GMT

In my team, we has a lot of Data science workflow using Spark and Pandas. In order to rassure the stability of workflows, we need to implement the unit test. Recently, I found out the DataFrame equality test functions introduced in Spark 3.5 which seems be easy to use. But while trying to import the asserDataFrameEqual, I got an AttributeError because of Numpy:

from pyspark.testing import assertDataFrameEqual

File /opt/spark/python/pyspark/pandas/strings.py:1332, in StringMethods()
   1328         return s.str.ljust(width, fillchar)
   1330     return self._data.pandas_on_spark.transform_batch(pandas_ljust)
-> 1332 def match(self, pat: str, case: bool = True, flags: int = 0, na: Any = np.NaN) -> "ps.Series":
   1333     """
   1334     Determine if each string matches a regular expression.
   1335 
   (...)   1390     dtype: object
   1391     """
   1393     def pandas_match(s) -> ps.Series[bool]:  # type: ignore[no-untyped-def]
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.

Which is very shameful as Numpy has published the version 2.0 since Juin 2024, and this library is an important dependencies with Pandas > 2.0 out Data Science environnement, we can not downgrade it just for using this test function.
Is there any solution or best practices for using the asserDataFrameEqual in test unit please ?

Re: Want to use DataFrame equality functions but also Numpy >= 2.0

ManojkMohan — Thu, 06 Nov 2025 17:41:29 GMT

@Y_WANG The root cause of the AttributeError you face when importing assertDataFrameEqual from pyspark.testing in Spark 3.5 is due to Spark's code using the deprecated np.NaN attribute, which was removed in NumPy 2.0 (replaced by np.nan). This breakage occurs because Spark 3.5 testing utilities still reference np.NaN

Alternative Testing Options
Consider external libraries like chispa for DataFrame equality testing while awaiting fixes:
https://github.com/databrickslabs/chispa

Use pandas.testing.assert_frame_equal carefully for pandas or pandas-on-Spark DataFrames, especially handling NaN equality explicitly.

Re: Want to use DataFrame equality functions but also Numpy >= 2.0

Y_WANG — Fri, 14 Nov 2025 14:55:17 GMT

Thanks for the answer, I am still surprised that this dependency has not yet been fixed. We have custom comparison methods in our project for the moment, I just want to replace them by official method because I want to make package lighter

thanks a lot;)