<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Want to use DataFrame equality functions but also Numpy &amp;gt;= 2.0 in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/139105#M51097</link>
    <description>&lt;P&gt;Thanks for the answer, I am still surprised that this dependency has not yet been fixed. We have custom comparison methods in our project for the moment, I just want to replace them by official method because I want to make package lighter&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks a lot;)&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Nov 2025 14:55:17 GMT</pubDate>
    <dc:creator>Y_WANG</dc:creator>
    <dc:date>2025-11-14T14:55:17Z</dc:date>
    <item>
      <title>Want to use DataFrame equality functions but also Numpy &gt;= 2.0</title>
      <link>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/137991#M50842</link>
      <description>&lt;P&gt;In my team, we has a lot of Data science workflow using Spark and Pandas. In order to rassure the stability of workflows, we need to implement the unit test. Recently, I found out the&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.testing.html" target="_blank" rel="noopener noreferrer"&gt;DataFrame equality test functions&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;introduced in Spark 3.5 which seems be easy to use. But while trying to import the asserDataFrameEqual, I got an AttributeError because of Numpy:&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN&gt;from pyspark.testing import assertDataFrameEqual&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;File &lt;/SPAN&gt;&lt;SPAN class=""&gt;/opt/spark/python/pyspark/pandas/strings.py:1332&lt;/SPAN&gt;, in &lt;SPAN class=""&gt;StringMethods&lt;/SPAN&gt;&lt;SPAN class=""&gt;()&lt;/SPAN&gt;
&lt;SPAN class=""&gt;   1328&lt;/SPAN&gt;         &lt;SPAN class=""&gt;return&lt;/SPAN&gt; s.str.ljust(width, fillchar)
&lt;SPAN class=""&gt;   1330&lt;/SPAN&gt;     &lt;SPAN class=""&gt;return&lt;/SPAN&gt; &lt;SPAN&gt;self&lt;/SPAN&gt;._data.pandas_on_spark.transform_batch(pandas_ljust)
&lt;SPAN class=""&gt;-&amp;gt; &lt;/SPAN&gt;&lt;SPAN class=""&gt;1332&lt;/SPAN&gt; &lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;match&lt;/SPAN&gt;(&lt;SPAN&gt;self&lt;/SPAN&gt;, pat: &lt;SPAN&gt;str&lt;/SPAN&gt;, case: &lt;SPAN&gt;bool&lt;/SPAN&gt; = &lt;SPAN class=""&gt;True&lt;/SPAN&gt;, flags: &lt;SPAN&gt;int&lt;/SPAN&gt; = &lt;SPAN class=""&gt;0&lt;/SPAN&gt;, na: Any = &lt;SPAN class=""&gt;np&lt;/SPAN&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN class=""&gt;NaN&lt;/SPAN&gt;) -&amp;gt; &lt;SPAN class=""&gt;"&lt;/SPAN&gt;&lt;SPAN class=""&gt;ps.Series&lt;/SPAN&gt;&lt;SPAN class=""&gt;"&lt;/SPAN&gt;:
&lt;SPAN class=""&gt;   1333&lt;/SPAN&gt;     &lt;SPAN class=""&gt;"""&lt;/SPAN&gt;
&lt;SPAN class=""&gt;   1334&lt;/SPAN&gt; &lt;SPAN class=""&gt;    Determine if each string matches a regular expression.&lt;/SPAN&gt;
&lt;SPAN class=""&gt;   1335&lt;/SPAN&gt; 
&lt;SPAN class=""&gt;   (...)&lt;/SPAN&gt;&lt;SPAN class=""&gt;   1390&lt;/SPAN&gt; &lt;SPAN class=""&gt;    dtype: object&lt;/SPAN&gt;
&lt;SPAN class=""&gt;   1391&lt;/SPAN&gt; &lt;SPAN class=""&gt;    """&lt;/SPAN&gt;
&lt;SPAN class=""&gt;   1393&lt;/SPAN&gt;     &lt;SPAN class=""&gt;def&lt;/SPAN&gt; &lt;SPAN class=""&gt;pandas_match&lt;/SPAN&gt;(s) -&amp;gt; ps.Series[&lt;SPAN&gt;bool&lt;/SPAN&gt;]:  &lt;SPAN&gt;# type: ignore[no-untyped-def]&lt;BR /&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;AttributeError&lt;/SPAN&gt;: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead&lt;/STRONG&gt;.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;Which is very shameful as Numpy has published the version 2.0 since Juin 2024, and this library is an important dependencies with Pandas &amp;gt; 2.0 out Data Science environnement, we can not downgrade it just for using this test function.&lt;BR /&gt;Is there any solution or best practices for using the&amp;nbsp;&lt;SPAN&gt;asserDataFrameEqual in test unit please ?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 06 Nov 2025 15:56:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/137991#M50842</guid>
      <dc:creator>Y_WANG</dc:creator>
      <dc:date>2025-11-06T15:56:29Z</dc:date>
    </item>
    <item>
      <title>Re: Want to use DataFrame equality functions but also Numpy &gt;= 2.0</title>
      <link>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/138009#M50845</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/196864"&gt;@Y_WANG&lt;/a&gt;&amp;nbsp;&amp;nbsp;The root cause of the AttributeError you face when importing assertDataFrameEqual from pyspark.testing in Spark 3.5 is due to Spark's code using the deprecated np.NaN attribute, which was removed in NumPy 2.0 (replaced by np.nan). This breakage occurs because Spark 3.5 testing utilities still reference np.NaN&lt;/P&gt;&lt;P&gt;Alternative Testing Options&lt;BR /&gt;Consider external libraries like chispa for DataFrame equality testing while awaiting fixes:&lt;BR /&gt;&lt;A href="https://github.com/databrickslabs/chispa" target="_blank"&gt;https://github.com/databrickslabs/chispa&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Use pandas.testing.assert_frame_equal carefully for pandas or pandas-on-Spark DataFrames, especially handling NaN equality explicitly.&lt;/P&gt;</description>
      <pubDate>Thu, 06 Nov 2025 17:41:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/138009#M50845</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-11-06T17:41:29Z</dc:date>
    </item>
    <item>
      <title>Re: Want to use DataFrame equality functions but also Numpy &gt;= 2.0</title>
      <link>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/139105#M51097</link>
      <description>&lt;P&gt;Thanks for the answer, I am still surprised that this dependency has not yet been fixed. We have custom comparison methods in our project for the moment, I just want to replace them by official method because I want to make package lighter&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks a lot;)&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Nov 2025 14:55:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/want-to-use-dataframe-equality-functions-but-also-numpy-gt-2-0/m-p/139105#M51097</guid>
      <dc:creator>Y_WANG</dc:creator>
      <dc:date>2025-11-14T14:55:17Z</dc:date>
    </item>
  </channel>
</rss>

