Databricks

JeromeB974 · ‎04-07-2022

Hi

is there a way to use spark-xml with delta live tables (Azure Databricks) ?

i 've try something like this without any succes for the moment

CREATE LIVE TABLE df17

USING com.databricks.spark.xml

AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")

Can we load this libray with dlt ?

Hubert-Dudek · ‎04-11-2022

@Jerome BASTIDE , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.

Autoloader doesn't support XML, so you need to load XML the traditional way.

@dlt.view
def dlt_dev_bronze():
  return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")

Kaniz · ‎04-26-2022

Hi @Jerome BASTIDE , Just a friendly follow-up. Do you still need help, or @Hubert Dudek (Customer) 's response help you to find the solution? Please let us know.

JeromeB974 · ‎04-28-2022

hi

no i didn't succeed to make it work neither in sql nor in python.

it seem to require spark-xml and i didn't find a way to use it with delta live tables.

i will try autoloader in binary.

Regards.

Kaniz · ‎05-10-2022

Hi @Jerome BASTIDE , Just checking in, Did you try autoloader in binary?

Zachary_Higgins · ‎05-16-2022

This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.

Found this post last week and couldn't find a solution. So here is my submission 🙂

@dlt.table(
  name="xmldata",
  comment="Some XML Data")
def dlt_xmldata():    
 
    host = ""
    token = ""
    clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    path = ""
    rowTag=""
 
    import subprocess
 
    pysh = """
    pip install databricks-cli
    rm ~/.databrickscfg
    ~/.databrickscfg
    echo "[DEFAULT]" >> ~/.databrickscfg
    echo "host = {1}" >> ~/.databrickscfg
    echo "token = {2}" >> ~/.databrickscfg
    export DATABRICKS_CONFIG_FILE=~/.databrickscfg
 
    databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
    databricks libraries list --cluster-id {0}
    """
    
    subprocess.run(pysh.format(clusterid,host,token),
        shell=True, check=True,
        executable='/bin/bash')
 
    return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)

Zachary_Higgins · ‎05-16-2022

Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow

Kaniz · ‎05-17-2022

Hi @Zachary Higgins , Thank you for posting a fantastic explanation here in the community.

Hi @JeromeB974 (Customer) , Just a friendly follow-up. Would you like to tell us if this reply from @Zachary Higgins helps you? Please let us know.

Zachary_Higgins · ‎06-01-2022

Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.

I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.

Databricks

can we use spark-xml with delta live tables ?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs