<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databric in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/133043#M49707</link>
    <description>&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Is Databricks PySpark a Fork?&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Not a true fork:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;It’s not maintained independently from Apache Spark but is a close superset with proprietary improvements. Codebases largely track, but Databricks sometimes “forks” select modules or patches, then later merges when upstream releases catch up.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Result:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Features may appear early, or behave differently, on Databricks than on the open source PySpark.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Thu, 25 Sep 2025 15:20:56 GMT</pubDate>
    <dc:creator>mark_ott</dc:creator>
    <dc:date>2025-09-25T15:20:56Z</dc:date>
    <item>
      <title>pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/126320#M47675</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using the&amp;nbsp;pyspark.testing.assertSchemaEqual() function in my code using the ignoreColumnOrder parameter that is available since pyspark 4.0.0.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/4.0.0/api/python/reference/api/pyspark.testing.assertSchemaEqual.html" target="_blank"&gt;https://spark.apache.org/docs/4.0.0/api/python/reference/api/pyspark.testing.assertSchemaEqual.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Locally I am using Databricks Connect. This "kind off" already includes pyspark, but not really. At least it is not the pyspark you install via pip. You can "import pyspark", but it is not installed explicitly. The code runs.&lt;BR /&gt;Now I installed a new packaged (soda-spark-df) which has the "real" pyspark as a dependency. It installs pyspark 3.5.6. as a dependency. Now I am getting an error that&amp;nbsp;ignoreColumnOrder cannot be found, since it does not exist in 3.5.6.&lt;BR /&gt;&lt;A href="https://spark.apache.org/docs/3.5.6/api/python/reference/api/pyspark.testing.assertSchemaEqual.html" target="_blank"&gt;https://spark.apache.org/docs/3.5.6/api/python/reference/api/pyspark.testing.assertSchemaEqual.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;So far so good. What surprises me is that I can use this parameter in my 15.4 runtime cluster even though pyspark 3.5.0 is installed?&lt;BR /&gt;&lt;BR /&gt;My question is now, is the pyspark on Databricks a fork from the OpenSource pyspark?&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Jul 2025 09:52:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/126320#M47675</guid>
      <dc:creator>Rainer</dc:creator>
      <dc:date>2025-07-24T09:52:46Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databric</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/133043#M49707</link>
      <description>&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Is Databricks PySpark a Fork?&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Not a true fork:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;It’s not maintained independently from Apache Spark but is a close superset with proprietary improvements. Codebases largely track, but Databricks sometimes “forks” select modules or patches, then later merges when upstream releases catch up.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Result:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Features may appear early, or behave differently, on Databricks than on the open source PySpark.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 25 Sep 2025 15:20:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/133043#M49707</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-09-25T15:20:56Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databric</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/133083#M49720</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176461"&gt;@Rainer&lt;/a&gt;&amp;nbsp;When you use Databricks Connect, your local code is executed against the Databricks cluster, which uses the Databricks Runtime’s PySpark, not your local PySpark installation. meaning your master driver node is also running on remote compute.I believe&amp;nbsp;&lt;SPAN&gt;Databricks Runtime uses the open-source Apache Spark codebase, but it often includes&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;patches, backports, and enhancements&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;that are not yet released in the official open-source PySpark packages on PyPI. This is the reason DBR has a different flavour and optimization then open source pyspark and distinguish them with other spark providers for an example fabric.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 26 Sep 2025 14:37:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-testing-assertschemaequal-ignorecolumnorder-parameter/m-p/133083#M49720</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2025-09-26T14:37:07Z</dc:date>
    </item>
  </channel>
</rss>

