topic Re: pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databric in Data Engineering

pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databricks

Rainer — Thu, 24 Jul 2025 09:52:46 GMT

Hi,

I am using the pyspark.testing.assertSchemaEqual() function in my code using the ignoreColumnOrder parameter that is available since pyspark 4.0.0.

https://spark.apache.org/docs/4.0.0/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

Locally I am using Databricks Connect. This "kind off" already includes pyspark, but not really. At least it is not the pyspark you install via pip. You can "import pyspark", but it is not installed explicitly. The code runs.
Now I installed a new packaged (soda-spark-df) which has the "real" pyspark as a dependency. It installs pyspark 3.5.6. as a dependency. Now I am getting an error that ignoreColumnOrder cannot be found, since it does not exist in 3.5.6.
https://spark.apache.org/docs/3.5.6/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

So far so good. What surprises me is that I can use this parameter in my 15.4 runtime cluster even though pyspark 3.5.0 is installed?

My question is now, is the pyspark on Databricks a fork from the OpenSource pyspark?

Re: pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databric

mark_ott — Thu, 25 Sep 2025 15:20:56 GMT

Is Databricks PySpark a Fork?

Not a true fork: It’s not maintained independently from Apache Spark but is a close superset with proprietary improvements. Codebases largely track, but Databricks sometimes “forks” select modules or patches, then later merges when upstream releases catch up.
Result: Features may appear early, or behave differently, on Databricks than on the open source PySpark.

Re: pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databric

saurabh18cs — Fri, 26 Sep 2025 14:37:07 GMT

Hi @Rainer When you use Databricks Connect, your local code is executed against the Databricks cluster, which uses the Databricks Runtime’s PySpark, not your local PySpark installation. meaning your master driver node is also running on remote compute.I believe Databricks Runtime uses the open-source Apache Spark codebase, but it often includes patches, backports, and enhancements that are not yet released in the official open-source PySpark packages on PyPI. This is the reason DBR has a different flavour and optimization then open source pyspark and distinguish them with other spark providers for an example fabric.