cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

pyspark.testing.assertSchemaEqual() ignoreColumnOrder parameter exists in 3.5.0 only on Databricks

Rainer
New Contributor

Hi, 

I am using the pyspark.testing.assertSchemaEqual() function in my code using the ignoreColumnOrder parameter that is available since pyspark 4.0.0. 

https://spark.apache.org/docs/4.0.0/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

Locally I am using Databricks Connect. This "kind off" already includes pyspark, but not really. At least it is not the pyspark you install via pip. You can "import pyspark", but it is not installed explicitly. The code runs.
Now I installed a new packaged (soda-spark-df) which has the "real" pyspark as a dependency. It installs pyspark 3.5.6. as a dependency. Now I am getting an error that ignoreColumnOrder cannot be found, since it does not exist in 3.5.6.
https://spark.apache.org/docs/3.5.6/api/python/reference/api/pyspark.testing.assertSchemaEqual.html

So far so good. What surprises me is that I can use this parameter in my 15.4 runtime cluster even though pyspark 3.5.0 is installed?

My question is now, is the pyspark on Databricks a fork from the OpenSource pyspark?






2 REPLIES 2

mark_ott
Databricks Employee
Databricks Employee

Is Databricks PySpark a Fork?

  • Not a true fork: Itโ€™s not maintained independently from Apache Spark but is a close superset with proprietary improvements. Codebases largely track, but Databricks sometimes โ€œforksโ€ select modules or patches, then later merges when upstream releases catch up.

  • Result: Features may appear early, or behave differently, on Databricks than on the open source PySpark.

saurabh18cs
Honored Contributor II

Hi @Rainer When you use Databricks Connect, your local code is executed against the Databricks cluster, which uses the Databricks Runtimeโ€™s PySpark, not your local PySpark installation. meaning your master driver node is also running on remote compute.I believe Databricks Runtime uses the open-source Apache Spark codebase, but it often includes patches, backports, and enhancements that are not yet released in the official open-source PySpark packages on PyPI. This is the reason DBR has a different flavour and optimization then open source pyspark and distinguish them with other spark providers for an example fabric.