topic Re: can we use spark-xml with delta live tables ? in Data Engineering

can we use spark-xml with delta live tables ?

JeromeB974 — Fri, 08 Apr 2022 06:29:31 GMT

is there a way to use spark-xml with delta live tables (Azure Databricks) ?

i 've try something like this without any succes for the moment

CREATE LIVE TABLE df17

USING com.databricks.spark.xml

AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")

Can we load this libray with dlt ?

Re: can we use spark-xml with delta live tables ?

Hubert-Dudek — Mon, 11 Apr 2022 20:38:40 GMT

@Jerome BASTIDE , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.

Autoloader doesn't support XML, so you need to load XML the traditional way.

@dlt.view
def dlt_dev_bronze():
  return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")

Re: can we use spark-xml with delta live tables ?

JeromeB974 — Thu, 28 Apr 2022 16:13:29 GMT

no i didn't succeed to make it work neither in sql nor in python.

it seem to require spark-xml and i didn't find a way to use it with delta live tables.

i will try autoloader in binary.

Regards.

Re: can we use spark-xml with delta live tables ?

Zachary_Higgins — Mon, 16 May 2022 17:48:18 GMT

This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.

Found this post last week and couldn't find a solution. So here is my submission 🙂

@dlt.table(
  name="xmldata",
  comment="Some XML Data")
def dlt_xmldata():    
 
    host = ""
    token = ""
    clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
    path = ""
    rowTag=""
 
    import subprocess
 
    pysh = """
    pip install databricks-cli
    rm ~/.databrickscfg
    ~/.databrickscfg
    echo "[DEFAULT]" >> ~/.databrickscfg
    echo "host = {1}" >> ~/.databrickscfg
    echo "token = {2}" >> ~/.databrickscfg
    export DATABRICKS_CONFIG_FILE=~/.databrickscfg
 
    databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
    databricks libraries list --cluster-id {0}
    """
    
    subprocess.run(pysh.format(clusterid,host,token),
        shell=True, check=True,
        executable='/bin/bash')
 
    return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)

Re: can we use spark-xml with delta live tables ?

Zachary_Higgins — Mon, 16 May 2022 18:57:35 GMT

Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow

Re: can we use spark-xml with delta live tables ?

Zachary_Higgins — Wed, 01 Jun 2022 20:42:44 GMT

Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.

I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.