Databricks Community

BNV · ‎01-08-2025

Trying to translate this line of a SQL query that evaluates XML to Databricks SQL.

SELECT

MyColumn.value('(/XMLData/Values/ValueDefinition[@colID="10"]/@Value)[1]', 'VARCHAR(max)') as Color

The XML looks like this:

Databricks SQL doesn't seem to support this function ("AnalysisException: Undefined function: MyColumn.value").

Is there anything I can substitute within the query line that will work?

hari-prasad · ‎01-08-2025

Hi @BNV,

You can leverage UDF or pandasUDF to register user defined functions to customize to parse the XML data using standard python libraries or even in Scala or Java in Databricks notebooks.

In SQL warehouse, you can create custom SQL UDF, follow this link for more Introducing SQL User-Defined Functions | Databricks Blog.

Regards,
Hari Prasad

BNV · ‎01-08-2025

Thank you but I'm not very familiar with Pandas. This might be out of my realm of knowledge.

Are you saying Pandas would have this functionality including using SQL and that SQL function or that I would need to create a UDF to parse XML (which sounds quite difficult).

hari-prasad · ‎01-08-2025

@BNV, you can leverage xpath SQL function which can parse the XML which works in both Notebook and SQL warehouse, follow this Spark SQL doc for more details https://spark.apache.org/docs/3.5.4/api/sql/#xpath

here is a sample example

Regards,
Hari Prasad

BNV · ‎01-08-2025

This might be a good start but I do get an error ("Invalid XPath") when trying to access the column as the xpath. Is it not possible to use a column as the xpath?

SELECT xpath(MyColumn.value, '/XMLData/Values/ValueDefinition[@colID="10"]/@Value)[1]') as Remarks

hari-prasad · ‎01-08-2025

you can share a sample or mocked value, how your xml looks?

mean while you can give a try with below query

SELECT xpath_string(MyColumn.value, '/XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1] as Remarks

Regards,
Hari Prasad

BNV · ‎01-09-2025

Hi. Thank you for replying. My XML sample is in the original post above if it helps.

I doesn't seem like the "_string" version because it's saying:

AnalysisException: Can't extract value from xpath_string(ExData#251799, /XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1]): need struct type but got string

hari-prasad · ‎01-09-2025

@BNV, below SQL code worked for me, I'm able to extract Red for colID=10

select
  xpath(
    '''
    <XMLData>
    <Values>
    <ValueDefinition colID="10" Value="Red"/>
    <ValueDefinition colID="20" Value="Square"/>
    <ValueDefinition colID="3" Value=""/>
    </Values>
    </XMLData>''',
    '//ValueDefinition[@colID="10"]/@Value'
  )[0] as value

Regards,
Hari Prasad

hari-prasad · ‎01-09-2025

@BNV , little more complex querying to convert XML into rows use below query. For your case just replace XML string with column containing XML value, spark will handle.

select c_value as colID, v_value as value from(select
  posexplode(
    xpath(
      '''
      <XMLData>
      <Values>
      <ValueDefinition colID="10" Value="Red"/>
      <ValueDefinition colID="20" Value="Square"/>
      <ValueDefinition colID="3" Value=""/>
      </Values>
      </XMLData>''',
      '//ValueDefinition/@colID'
    )
  ) as (c_index, c_value),
  posexplode(
    xpath(
      '''
      <XMLData>
      <Values>
      <ValueDefinition colID="10" Value="Red"/>
      <ValueDefinition colID="20" Value="Square"/>
      <ValueDefinition colID="3" Value=""/>
      </Values>
      </XMLData>''',
      '//ValueDefinition/@Value'
    )
  ) as (v_index, v_value))
  where c_index = v_index

Regards,

Hari Prasad

Regards,
Hari Prasad

Stefan-Koch · ‎01-08-2025

Since Spark Runtime 14.3 and higher, it is possible to read XML using the Spark Read method.

For example:

df = spark.read.option("rowTag", "books").format("xml").load(xmlPath)
df.printSchema()
df.show(truncate=False)

Have a look at the docu: https://docs.databricks.com/en/query/formats/xml.html

hari-prasad · ‎01-08-2025

Yes, now they support XML parse directly in databricks 14.3 or higher, else earlier you could have leveraged spark xml library jars to parse it.

You can still leverage xpath in case where one of data column hold XML value in a dataset. As @BNV is looking for some SQL based approach.

Regards,
Hari Prasad

Databricks Community

Translating SQL Value Function For XML To Databricks SQL

Photos

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch