In my team, we has a lot of Data science workflow using Spark and Pandas. In order to rassure the stability of workflows, we need to implement the unit test. Recently, I found out the DataFrame equality test functions introduced in Spark 3.5 which seems be easy to use. But while trying to import the asserDataFrameEqual, I got an AttributeError because of Numpy:
from pyspark.testing import assertDataFrameEqual
File /opt/spark/python/pyspark/pandas/strings.py:1332, in StringMethods()
1328 return s.str.ljust(width, fillchar)
1330 return self._data.pandas_on_spark.transform_batch(pandas_ljust)
-> 1332 def match(self, pat: str, case: bool = True, flags: int = 0, na: Any = np.NaN) -> "ps.Series":
1333 """
1334 Determine if each string matches a regular expression.
1335
(...) 1390 dtype: object
1391 """
1393 def pandas_match(s) -> ps.Series[bool]: # type: ignore[no-untyped-def]
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.
Which is very shameful as Numpy has published the version 2.0 since Juin 2024, and this library is an important dependencies with Pandas > 2.0 out Data Science environnement, we can not downgrade it just for using this test function.
Is there any solution or best practices for using the asserDataFrameEqual in test unit please ?