can we use spark-xml with delta live tables ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-07-2022 11:29 PM
Hi
is there a way to use spark-xml with delta live tables (Azure Databricks) ?
i 've try something like this without any succes for the moment
CREATE LIVE TABLE df17
USING com.databricks.spark.xml
AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")
Can we load this libray with dlt ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-11-2022 01:38 PM
@Jerome BASTIDE , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.
Autoloader doesn't support XML, so you need to load XML the traditional way.
@dlt.view
def dlt_dev_bronze():
return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-28-2022 09:13 AM
hi
no i didn't succeed to make it work neither in sql nor in python.
it seem to require spark-xml and i didn't find a way to use it with delta live tables.
i will try autoloader in binary.
Regards.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2022 10:48 AM
This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.
Found this post last week and couldn't find a solution. So here is my submission 🙂
@dlt.table(
name="xmldata",
comment="Some XML Data")
def dlt_xmldata():
host = ""
token = ""
clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
path = ""
rowTag=""
import subprocess
pysh = """
pip install databricks-cli
rm ~/.databrickscfg
~/.databrickscfg
echo "[DEFAULT]" >> ~/.databrickscfg
echo "host = {1}" >> ~/.databrickscfg
echo "token = {2}" >> ~/.databrickscfg
export DATABRICKS_CONFIG_FILE=~/.databrickscfg
databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
databricks libraries list --cluster-id {0}
"""
subprocess.run(pysh.format(clusterid,host,token),
shell=True, check=True,
executable='/bin/bash')
return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2022 11:57 AM
Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-01-2022 01:42 PM
Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.
I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.

