cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

can we use spark-xml with delta live tables ?

JeromeB974
New Contributor II

Hi

is there a way to use spark-xml with delta live tables (Azure Databricks) ?

i 've try something like this without any succes for the moment

CREATE LIVE TABLE df17 

USING com.databricks.spark.xml

AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")

Can we load this libray with dlt ?

8 REPLIES 8

Hubert-Dudek
Esteemed Contributor III

@Jerome BASTIDE​ , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.

Autoloader doesn't support XML, so you need to load XML the traditional way.

@dlt.view
def dlt_dev_bronze():
  return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")

Kaniz
Community Manager
Community Manager

Hi @Jerome BASTIDE​ , Just a friendly follow-up. Do you still need help, or @Hubert Dudek (Customer)​ 's response help you to find the solution? Please let us know.

JeromeB974
New Contributor II

hi

no i didn't succeed to make it work neither in sql nor in python.

it seem to require spark-xml and i didn't find a way to use it with delta live tables.

i will try autoloader in binary.

Regards.

Kaniz
Community Manager
Community Manager

Hi @Jerome BASTIDE​ , Just checking in, Did you try autoloader in binary?

Zachary_Higgins
Contributor

This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.

Found this post last week and couldn't find a solution. So here is my submission 🙂

@dlt.table(
  name="xmldata",
  comment="Some XML Data")
def dlt_xmldata():    
 
    host = ""
    token = ""
    clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    path = ""
    rowTag=""
 
    import subprocess
 
    pysh = """
    pip install databricks-cli
    rm ~/.databrickscfg
    ~/.databrickscfg
    echo "[DEFAULT]" >> ~/.databrickscfg
    echo "host = {1}" >> ~/.databrickscfg
    echo "token = {2}" >> ~/.databrickscfg
    export DATABRICKS_CONFIG_FILE=~/.databrickscfg
 
    databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
    databricks libraries list --cluster-id {0}
    """
    
    subprocess.run(pysh.format(clusterid,host,token),
        shell=True, check=True,
        executable='/bin/bash')
 
    return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)

Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow

Hi @Zachary Higgins​ , Thank you for posting a fantastic explanation here in the community.

Hi @JeromeB974 (Customer)​ , Just a friendly follow-up. Would you like to tell us if this reply from @Zachary Higgins​ helps you? Please let us know.

Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.

I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.