cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Runtime, Pyspark and Spark Versions

loujiang
New Contributor II

Hello, Dear community,

I was go through the documentation of function from_xml here pyspark.sql.functions.from_xml โ€” PySpark 4.1.2 documentation, it denotes that it is available in pyspark version higher than 4.0.0. 

Meanwhile, we have documentation for from_xml at Azure/ AWS,from_xml function - Azure Databricks - Databricks SQL | Microsoft Learn the support of it is above Databricks Runtime 14.1. 

But the Databricks Runtime 14.1 are using Apache Spark version 3.5.0, which should has no from_xml implementation. How should we understand this difference?

Thanks

best wishes

loujiang

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @loujiang ,

Databricks Runtime is not a vanilla Apache Spark distribution. DBR is built on top of a highly optimized version of Apache Spark, but also adds enhancements and additional components that substantially improve usability, performance, and security beyond what's in the open-source release. This means Databricks can - and regularly does - ship Spark features ahead of their upstream release.

Looking directly at the DBR 14.1 release notes, the Spark changelog section lists: Databricks Runtime 14.1 (EoS) | Databricks on Google Cloud

[SPARK-44788] [SC-142980][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

This JIRA ticket was cherry-picked into DBR 14.1, even though DBR 14.1 runs on Spark 3.5.0. Databricks applied this patch internally before it landed in an official Apache Spark release.

 

If my answer was helpful, please consider marking it as accepted solution

View solution in original post

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @loujiang ,

Databricks Runtime is not a vanilla Apache Spark distribution. DBR is built on top of a highly optimized version of Apache Spark, but also adds enhancements and additional components that substantially improve usability, performance, and security beyond what's in the open-source release. This means Databricks can - and regularly does - ship Spark features ahead of their upstream release.

Looking directly at the DBR 14.1 release notes, the Spark changelog section lists: Databricks Runtime 14.1 (EoS) | Databricks on Google Cloud

[SPARK-44788] [SC-142980][CONNECT][PYTHON][SQL] Add from_xml and schema_of_xml to pyspark, spark connect and sql function

This JIRA ticket was cherry-picked into DBR 14.1, even though DBR 14.1 runs on Spark 3.5.0. Databricks applied this patch internally before it landed in an official Apache Spark release.

 

If my answer was helpful, please consider marking it as accepted solution