Databricks Community

JeromeB974 · ‎04-07-2022

Hi

is there a way to use spark-xml with delta live tables (Azure Databricks) ?

i 've try something like this without any succes for the moment

CREATE LIVE TABLE df17

USING com.databricks.spark.xml

AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")

Can we load this libray with dlt ?

Hubert-Dudek · ‎04-11-2022

@Jerome BASTIDE , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.

Autoloader doesn't support XML, so you need to load XML the traditional way.

@dlt.view
def dlt_dev_bronze():
  return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")

My blog: https://databrickster.medium.com/

JeromeB974 · ‎04-28-2022

hi

no i didn't succeed to make it work neither in sql nor in python.

it seem to require spark-xml and i didn't find a way to use it with delta live tables.

i will try autoloader in binary.

Regards.

Zachary_Higgins · ‎05-16-2022

This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.

Found this post last week and couldn't find a solution. So here is my submission 🙂

@dlt.table(
  name="xmldata",
  comment="Some XML Data")
def dlt_xmldata():    
 
    host = ""
    token = ""
    clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    path = ""
    rowTag=""
 
    import subprocess
 
    pysh = """
    pip install databricks-cli
    rm ~/.databrickscfg
    ~/.databrickscfg
    echo "[DEFAULT]" >> ~/.databrickscfg
    echo "host = {1}" >> ~/.databrickscfg
    echo "token = {2}" >> ~/.databrickscfg
    export DATABRICKS_CONFIG_FILE=~/.databrickscfg
 
    databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
    databricks libraries list --cluster-id {0}
    """
    
    subprocess.run(pysh.format(clusterid,host,token),
        shell=True, check=True,
        executable='/bin/bash')
 
    return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)

Zachary_Higgins · ‎05-16-2022

Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow

Zachary_Higgins · ‎06-01-2022

Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.

I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.

Databricks Community

can we use spark-xml with delta live tables ?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐