yesterday
Trying to translate this line of a SQL query that evaluates XML to Databricks SQL.
yesterday - last edited yesterday
Hi @BNV,
You can leverage UDF or pandasUDF to register user defined functions to customize to parse the XML data using standard python libraries or even in Scala or Java in Databricks notebooks.
In SQL warehouse, you can create custom SQL UDF, follow this link for more Introducing SQL User-Defined Functions | Databricks Blog.
yesterday
Thank you but I'm not very familiar with Pandas. This might be out of my realm of knowledge.
Are you saying Pandas would have this functionality including using SQL and that SQL function or that I would need to create a UDF to parse XML (which sounds quite difficult).
yesterday - last edited yesterday
@BNV, you can leverage xpath SQL function which can parse the XML which works in both Notebook and SQL warehouse, follow this Spark SQL doc for more details https://spark.apache.org/docs/3.5.4/api/sql/#xpath
here is a sample example
yesterday
This might be a good start but I do get an error ("Invalid XPath") when trying to access the column as the xpath. Is it not possible to use a column as the xpath?
yesterday - last edited yesterday
you can share a sample or mocked value, how your xml looks?
mean while you can give a try with below query
11 hours ago
Hi. Thank you for replying. My XML sample is in the original post above if it helps.
I doesn't seem like the "_string" version because it's saying:
AnalysisException: Can't extract value from xpath_string(ExData#251799, /XMLData/Values/ValueDefinition[@colID="10"]/@Value')[1]): need struct type but got string
10 hours ago
@BNV, below SQL code worked for me, I'm able to extract Red for colID=10
select
xpath(
'''
<XMLData>
<Values>
<ValueDefinition colID="10" Value="Red"/>
<ValueDefinition colID="20" Value="Square"/>
<ValueDefinition colID="3" Value=""/>
</Values>
</XMLData>''',
'//ValueDefinition[@colID="10"]/@Value'
)[0] as value
10 hours ago
@BNV , little more complex querying to convert XML into rows use below query. For your case just replace XML string with column containing XML value, spark will handle.
select c_value as colID, v_value as value from(select
posexplode(
xpath(
'''
<XMLData>
<Values>
<ValueDefinition colID="10" Value="Red"/>
<ValueDefinition colID="20" Value="Square"/>
<ValueDefinition colID="3" Value=""/>
</Values>
</XMLData>''',
'//ValueDefinition/@colID'
)
) as (c_index, c_value),
posexplode(
xpath(
'''
<XMLData>
<Values>
<ValueDefinition colID="10" Value="Red"/>
<ValueDefinition colID="20" Value="Square"/>
<ValueDefinition colID="3" Value=""/>
</Values>
</XMLData>''',
'//ValueDefinition/@Value'
)
) as (v_index, v_value))
where c_index = v_index
Regards,
Hari Prasad
yesterday
Since Spark Runtime 14.3 and higher, it is possible to read XML using the Spark Read method.
For example:
df = spark.read.option("rowTag", "books").format("xml").load(xmlPath)
df.printSchema()
df.show(truncate=False)
Have a look at the docu: https://docs.databricks.com/en/query/formats/xml.html
yesterday
Yes, now they support XML parse directly in databricks 14.3 or higher, else earlier you could have leveraged spark xml library jars to parse it.
You can still leverage xpath in case where one of data column hold XML value in a dataset. As @BNV is looking for some SQL based approach.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group