โ04-07-2022 11:29 PM
Hi
is there a way to use spark-xml with delta live tables (Azure Databricks) ?
i 've try something like this without any succes for the moment
CREATE LIVE TABLE df17
USING com.databricks.spark.xml
AS SELECT * FROM cloud_files("/mnt/dev/bronze/xml/s4327994", "xml")
Can we load this libray with dlt ?
โ04-11-2022 01:38 PM
@Jerome BASTIDEโ , Custom implementations are more straightforward in python. You can read whatever. Just return DataFrame.
Autoloader doesn't support XML, so you need to load XML the traditional way.
@dlt.view
def dlt_dev_bronze():
return spark.read.option("rowTag", "tag").xml("dbfs:/mnt/dev/bronze/xml/s4327994")
โ04-26-2022 04:13 AM
Hi @Jerome BASTIDEโ , Just a friendly follow-up. Do you still need help, or @Hubert Dudek (Customer)โ 's response help you to find the solution? Please let us know.
โ04-28-2022 09:13 AM
hi
no i didn't succeed to make it work neither in sql nor in python.
it seem to require spark-xml and i didn't find a way to use it with delta live tables.
i will try autoloader in binary.
Regards.
โ05-10-2022 05:10 AM
Hi @Jerome BASTIDEโ , Just checking in, Did you try autoloader in binary?
โ05-16-2022 10:48 AM
This is a tough one since the only magic command available is %pip, but spark-xml is a maven package. The only way I found to do this was to install the spark-xml jar from the maven repo using the databricks-cli. You can reference the cluster ID using spark.conf.get("spark.databricks.clusterUsageTags.clusterId"), something not well documented in the databricks cli documentation. This is not secure/production ready, but is a good starting point.
Found this post last week and couldn't find a solution. So here is my submission ๐
@dlt.table(
name="xmldata",
comment="Some XML Data")
def dlt_xmldata():
host = ""
token = ""
clusterid = spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
path = ""
rowTag=""
import subprocess
pysh = """
pip install databricks-cli
rm ~/.databrickscfg
~/.databrickscfg
echo "[DEFAULT]" >> ~/.databrickscfg
echo "host = {1}" >> ~/.databrickscfg
echo "token = {2}" >> ~/.databrickscfg
export DATABRICKS_CONFIG_FILE=~/.databrickscfg
databricks libraries install --cluster-id {0} --maven-coordinates "com.databricks:spark-xml_2.12:0.14.0"
databricks libraries list --cluster-id {0}
"""
subprocess.run(pysh.format(clusterid,host,token),
shell=True, check=True,
executable='/bin/bash')
return spark.read.format("xml").option("rowTag",rowTag).option("nullValue","").load(path)
โ05-16-2022 11:57 AM
Also need to give credit where credit is due regarding the idea to setup databricks-cli from the notebook: How to fix 'command not found' error in Databricks when creating a secret scope - Stack Overflow
โ05-17-2022 12:20 AM
Hi @Zachary Higginsโ , Thank you for posting a fantastic explanation here in the community.
Hi @JeromeB974 (Customer)โ , Just a friendly follow-up. Would you like to tell us if this reply from @Zachary Higginsโ helps you? Please let us know.
โ06-01-2022 01:42 PM
Just following up. My submission is a bad solution and shouldn't be implemented. This broke the moment we used %pip to install additional libraries.
I sent our wishes to the Databricks reps we work with, but at this time there doesn't seem to be a good way to support XML. In our case, we added a workflow task (scheduled job) to load these XML documents into a delta table, and work the delta tables as one of the sources in our DLT pipeline.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group