Databricks Community

Rainer · ‎07-24-2025

Hi,

I am using the pyspark.testing.assertSchemaEqual() function in my code using the ignoreColumnOrder parameter that is available since pyspark 4.0.0.

https://spark.apache.org/docs/4.0.0/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

Locally I am using Databricks Connect. This "kind off" already includes pyspark, but not really. At least it is not the pyspark you install via pip. You can "import pyspark", but it is not installed explicitly. The code runs.
Now I installed a new packaged (soda-spark-df) which has the "real" pyspark as a dependency. It installs pyspark 3.5.6. as a dependency. Now I am getting an error that ignoreColumnOrder cannot be found, since it does not exist in 3.5.6.
https://spark.apache.org/docs/3.5.6/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

So far so good. What surprises me is that I can use this parameter in my 15.4 runtime cluster even though pyspark 3.5.0 is installed?

My question is now, is the pyspark on Databricks a fork from the OpenSource pyspark?

mark_ott · ‎09-25-2025

Is Databricks PySpark a Fork?

Not a true fork: It’s not maintained independently from Apache Spark but is a close superset with proprietary improvements. Codebases largely track, but Databricks sometimes “forks” select modules or patches, then later merges when upstream releases catch up.
Result: Features may appear early, or behave differently, on Databricks than on the open source PySpark.

saurabh18cs · ‎09-26-2025

Hi @Rainer When you use Databricks Connect, your local code is executed against the Databricks cluster, which uses the Databricks Runtime’s PySpark, not your local PySpark installation. meaning your master driver node is also running on remote compute.I believe Databricks Runtime uses the open-source Apache Spark codebase, but it often includes patches, backports, and enhancements that are not yet released in the official open-source PySpark packages on PyPI. This is the reason DBR has a different flavour and optimization then open source pyspark and distinguish them with other spark providers for an example fabric.

Databricks Community

pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databricks

Is Databricks PySpark a Fork?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples