cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

can we use spark-xml with delta live tables ?

JeromeB974
New Contributor II

Hi

is there a way to use spark-xml with delta live tables (Azure Databricks) ?

i 've try something like this without any succes for the moment

CREATE LIVE TABLE df17 

USING com.databricks.spark.xml

AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")

Can we load this libray with dlt ?

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

@Jerome BASTIDEโ€‹ , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.

Autoloader doesn't support XML, so you need to load XML the traditional way.

@dlt.view
def dlt_dev_bronze():
  return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")

JeromeB974
New Contributor II

hi

no i didn't succeed to make it work neither in sql nor in python.

it seem to require spark-xml and i didn't find a way to use it with delta live tables.

i will try autoloader in binary.

Regards.

Zachary_Higgins
Contributor

This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.

Found this post last week and couldn't find a solution. So here is my submission ๐Ÿ™‚

@dlt.table(
  name="xmldata",
  comment="Some XML Data")
def dlt_xmldata():    
 
    host = ""
    token = ""
    clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    path = ""
    rowTag=""
 
    import subprocess
 
    pysh = """
    pip install databricks-cli
    rm ~/.databrickscfg
    ~/.databrickscfg
    echo "[DEFAULT]" >> ~/.databrickscfg
    echo "host = {1}" >> ~/.databrickscfg
    echo "token = {2}" >> ~/.databrickscfg
    export DATABRICKS_CONFIG_FILE=~/.databrickscfg
 
    databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
    databricks libraries list --cluster-id {0}
    """
    
    subprocess.run(pysh.format(clusterid,host,token),
        shell=True, check=True,
        executable='/bin/bash')
 
    return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)

Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow

Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.

I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group