cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Translating SQL Value Function For XML To Databricks SQL

BNV
New Contributor

Trying to translate this line of a SQL query that evaluates XML to Databricks SQL.

SELECT
   MyColumn.value('(/XMLData/Values/ValueDefinition[@colID="10"]/@Value)[1]', 'VARCHAR(max)') as Color
 
The XML looks like this:
<XMLData><Values><ValueDefinition colID="10" Value="Red"/><ValueDefinition colID="20" Value="Square"/><ValueDefinition colID="3" Value=""/></Values></XMLData>
 
Databricks SQL doesn't seem to support this function ("AnalysisException: Undefined function: MyColumn.value").
 
Is there anything I can substitute within the query line that will work?
10 REPLIES 10

hari-prasad
Contributor III

Hi @BNV,

You can leverage UDF or pandasUDF to register user defined functions to customize to parse the XML data using standard python libraries or even in Scala or Java in Databricks notebooks.

In SQL warehouse, you can create custom SQL UDF, follow this link for more Introducing SQL User-Defined Functions | Databricks Blog.



Regards,
Hari Prasad

Thank you but I'm not very familiar with Pandas. This might be out of my realm of knowledge.

Are you saying Pandas would have this functionality including using SQL and that SQL function or that I would need to create a UDF to parse XML (which sounds quite difficult).

hari-prasad
Contributor III

@BNV, you can leverage xpath SQL function which can parse the XML which works in both Notebook and SQL warehouse, follow this Spark SQL doc for more details https://spark.apache.org/docs/3.5.4/api/sql/#xpath

here is a sample example

hariprasad_0-1736357597347.png

 



Regards,
Hari Prasad

BNV
New Contributor

This might be a good start but I do get an error ("Invalid XPath") when trying to access the column as the xpath. Is it not possible to use a column as the xpath?

SELECT xpath(MyColumn.value'/XMLData/Values/ValueDefinition[@colID="10"]/@Value)[1]') as Remarks



hari-prasad
Contributor III

you can share a sample or mocked value, how your xml looks?

mean while you can give a try with below query

SELECT xpath_string(MyColumn.value, '/XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1] as Remarks


Regards,
Hari Prasad

Hi. Thank you for replying. My XML sample is in the original post above if it helps.

I doesn't seem like the "_string" version because it's saying:

AnalysisException: Can't extract value from xpath_string(ExData#251799, /XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1]): need struct type but got string

@BNV,  below SQL code worked for me, I'm able to extract Red for colID=10

select
  xpath(
    '''
    <XMLData>
    <Values>
    <ValueDefinition colID="10" Value="Red"/>
    <ValueDefinition colID="20" Value="Square"/>
    <ValueDefinition colID="3" Value=""/>
    </Values>
    </XMLData>''',
    '//ValueDefinition[@colID="10"]/@Value'
  )[0] as value

hariprasad_0-1736442966726.png

 



Regards,
Hari Prasad

@BNV , little more complex querying to convert XML into rows use below query. For your case just replace XML string with column containing XML value, spark will handle.

select c_value as colID, v_value as value from(select
  posexplode(
    xpath(
      '''
      <XMLData>
      <Values>
      <ValueDefinition colID="10" Value="Red"/>
      <ValueDefinition colID="20" Value="Square"/>
      <ValueDefinition colID="3" Value=""/>
      </Values>
      </XMLData>''',
      '//ValueDefinition/@colID'
    )
  ) as (c_index, c_value),
  posexplode(
    xpath(
      '''
      <XMLData>
      <Values>
      <ValueDefinition colID="10" Value="Red"/>
      <ValueDefinition colID="20" Value="Square"/>
      <ValueDefinition colID="3" Value=""/>
      </Values>
      </XMLData>''',
      '//ValueDefinition/@Value'
    )
  ) as (v_index, v_value))
  where c_index = v_index

 

Regards,

Hari Prasad



Regards,
Hari Prasad

Stefan-Koch
Contributor III

Since Spark Runtime 14.3 and higher, it is possible to read XML using the Spark Read method.

For example:

df = spark.read.option("rowTag", "books").format("xml").load(xmlPath)
df.printSchema()
df.show(truncate=False)

 

Have a look at the docu: https://docs.databricks.com/en/query/formats/xml.html

hari-prasad
Contributor III

Yes, now they support XML parse directly in databricks 14.3 or higher, else earlier you could have leveraged spark xml library jars to parse it.

You can still leverage xpath in case where one of data column hold XML value in a dataset. As @BNV is looking for some SQL based approach.



Regards,
Hari Prasad

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group