Databricks Community

BNV · yesterday

Trying to translate this line of a SQL query that evaluates XML to Databricks SQL.

SELECT

MyColumn.value('(/XMLData/Values/ValueDefinition[@colID="10"]/@Value)[1]', 'VARCHAR(max)') as Color

The XML looks like this:

Databricks SQL doesn't seem to support this function ("AnalysisException: Undefined function: MyColumn.value").

Is there anything I can substitute within the query line that will work?

hari-prasad · yesterday

Hi @BNV,

You can leverage UDF or pandasUDF to register user defined functions to customize to parse the XML data using standard python libraries or even in Scala or Java in Databricks notebooks.

In SQL warehouse, you can create custom SQL UDF, follow this link for more Introducing SQL User-Defined Functions | Databricks Blog.

Regards,
Hari Prasad

BNV · yesterday

Thank you but I'm not very familiar with Pandas. This might be out of my realm of knowledge.

Are you saying Pandas would have this functionality including using SQL and that SQL function or that I would need to create a UDF to parse XML (which sounds quite difficult).

hari-prasad · yesterday

@BNV, you can leverage xpath SQL function which can parse the XML which works in both Notebook and SQL warehouse, follow this Spark SQL doc for more details https://spark.apache.org/docs/3.5.4/api/sql/#xpath

here is a sample example

Regards,
Hari Prasad

BNV · yesterday

This might be a good start but I do get an error ("Invalid XPath") when trying to access the column as the xpath. Is it not possible to use a column as the xpath?

SELECT xpath(MyColumn.value, '/XMLData/Values/ValueDefinition[@colID="10"]/@Value)[1]') as Remarks

hari-prasad · yesterday

you can share a sample or mocked value, how your xml looks?

mean while you can give a try with below query

SELECT xpath_string(MyColumn.value, '/XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1] as Remarks

Regards,
Hari Prasad

BNV · 38m ago

Hi. Thank you for replying. My XML sample is in the original post above if it helps.

I doesn't seem like the "_string" version because it's saying:

AnalysisException: Can't extract value from xpath_string(ExData#251799, /XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1]): need struct type but got string

hari-prasad · 14m ago

@BNV, below SQL code worked for me, I'm able to extract Red for colID=10

select
  xpath(
    '''
    <XMLData>
    <Values>
    <ValueDefinition colID="10" Value="Red"/>
    <ValueDefinition colID="20" Value="Square"/>
    <ValueDefinition colID="3" Value=""/>
    </Values>
    </XMLData>''',
    '//ValueDefinition[@colID="10"]/@Value'
  )[0] as value

Regards,
Hari Prasad

hari-prasad · 6m ago

@BNV , little more complex querying to convert XML into rows use below query. For your case just replace XML string with column containing XML value, spark will handle.

select c_value as colID, v_value as value from(select
  posexplode(
    xpath(
      '''
      <XMLData>
      <Values>
      <ValueDefinition colID="10" Value="Red"/>
      <ValueDefinition colID="20" Value="Square"/>
      <ValueDefinition colID="3" Value=""/>
      </Values>
      </XMLData>''',
      '//ValueDefinition/@colID'
    )
  ) as (c_index, c_value),
  posexplode(
    xpath(
      '''
      <XMLData>
      <Values>
      <ValueDefinition colID="10" Value="Red"/>
      <ValueDefinition colID="20" Value="Square"/>
      <ValueDefinition colID="3" Value=""/>
      </Values>
      </XMLData>''',
      '//ValueDefinition/@Value'
    )
  ) as (v_index, v_value))
  where c_index = v_index

Regards,

Hari Prasad

Regards,
Hari Prasad

Stefan-Koch · yesterday

Since Spark Runtime 14.3 and higher, it is possible to read XML using the Spark Read method.

For example:

df = spark.read.option("rowTag", "books").format("xml").load(xmlPath)
df.printSchema()
df.show(truncate=False)

Have a look at the docu: https://docs.databricks.com/en/query/formats/xml.html

hari-prasad · yesterday

Yes, now they support XML parse directly in databricks 14.3 or higher, else earlier you could have leveraged spark xml library jars to parse it.

You can still leverage xpath in case where one of data column hold XML value in a dataset. As @BNV is looking for some SQL based approach.

Regards,
Hari Prasad

Databricks Community

Translating SQL Value Function For XML To Databricks SQL

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!