topic Re: XML DLT Autoloader - Ingestion of XML Files in Data Engineering

XML DLT Autoloader - Ingestion of XML Files

avrm91 — Fri, 31 May 2024 14:01:43 GMT

I want to ingest multiple XML files with varying but similar structures without defining a schema.

For example:

<?xml version="1.0" encoding="ISO-8859-1"?> <LIEFERUNG> <ABSENDER> <RZLZ>R00000001</RZLZ> <NAME>Informatik GmbH </NAME> <STRASSE>Muster-Allee 90</STRASSE> <PLZ>60486</PLZ> <ORT>Frankfurt a.M.</ORT> <LAND>DE</LAND> <KONTAKT> <ANREDE>Herr</ANREDE> <VORNAME>Max</VORNAME> <ZUNAME>Mustermann</ZUNAME> <TELEFON>xxxxx/xxxx-xx</TELEFON> <FAX>xxx/xxxxx-xx</FAX> <EMAIL>max.mustermann@informatik.de</EMAIL> </KONTAKT> </ABSENDER> <ERSTELLER> <BLZ>0000000000000</BLZ> <NAME>MUSTERBANK</NAME> <STRASSE>MUSTER STR. 1</STRASSE> <PLZ>12345</PLZ> <ORT>Musterhausen</ORT> <LAND>DE</LAND> <KONTAKT> <ANREDE>Frau</ANREDE> <VORNAME>Maxime</VORNAME> <ZUNAME>Musterdame</ZUNAME> <ABTEILUNG>Contolling</ABTEILUNG> <TELEFON>xxxxx xx-xxxxx</TELEFON> <FAX>xxxxx xx-xxxxx</FAX> <EMAIL>maxime.musterdame@aol.de</EMAIL> </KONTAKT> </ERSTELLER> <MELDUNG erstellzeit="2020-09-02T11:23:33"> ..... </MELDUNG> </LIEFERUNG>

I attempted to load the XML within DLT pipelines using the following code:

%sql CREATE OR REFRESH STREAMING TABLE conformed COMMENT "BRONZE" AS SELECT * FROM cloud_files( "<file-path>" "xml", map( "rowTag", "LIEFERUNG", "mode", "FAILFAST", "inferSchema", "true", "encoding", "ISO-8859-1", "ignoreNamespace", "false", "timestampFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]", "timestampNTZFormat", "yyyy-MM-dd'T'HH:mm:ss[.SSS]", "dateFormat", "yyyy-MM-dd", "locale", "de-DE", "readerCaseSensitive", "true" ) ) as src

The result is a table with 3 columns:

Column Type
ABSENDER string

ERSTELLER string

MELDUNG string

Within those columns is XML Code, e.g.

ABSENDER:

<RZLZ>R00000001</RZLZ> <NAME>Informatik GmbH </NAME> <STRASSE>Muster-Allee 90</STRASSE> <PLZ>60486</PLZ> <ORT>Frankfurt a.M.</ORT> <LAND>DE</LAND> <KONTAKT> <ANREDE>Herr</ANREDE> <VORNAME>Max</VORNAME> <ZUNAME>Mustermann</ZUNAME> <TELEFON>xxxxx/xxxx-xx</TELEFON> <FAX>xxx/xxxxx-xx</FAX> <EMAIL>max.mustermann@informatik.de</EMAIL> </KONTAKT>

Now the questions:

Is it best practice for BRONZE to have xml stored like this, as it will have the least "issues" with schema changes, as the schema will just change within these tags?
Nevertheless, how can I make it within BRONZE DLT Step with SQL to explode the xml without defining the schema?
And how would I explode the XML within SILVER without defining the schema?

I already tried with those code:

SELECT from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'), schema_of_xml(CONCAT('<ABSENDER>', '<RZLZ>R00000001</RZLZ> <NAME>Informatik GmbH </NAME> <STRASSE>Muster-Allee 90</STRASSE> <PLZ>60486</PLZ> <ORT>Frankfurt a.M.</ORT> <LAND>DE</LAND> <KONTAKT> <ANREDE>Herr</ANREDE> <VORNAME>Max</VORNAME> <ZUNAME>Mustermann</ZUNAME> <TELEFON>xxxxx/xxxx-xx</TELEFON> <FAX>xxx/xxxxx-xx</FAX> <EMAIL>max.mustermann@informatik.de</EMAIL> </KONTAKT>', '</ABSENDER>'))) AS ABSENDER_XML FROM conformed

But it gave me just a bad result with most of the attributes not mapped.

object KONTAKT: null LAND: null NAME: null ORT: null PLZ: null RZLZ: "R00000001" STRASSE: null

Re: XML DLT Autoloader - Ingestion of XML Files

avrm91 — Mon, 03 Jun 2024 14:29:35 GMT

@Retired_mod Thanks a lot.
I found an issue in from_xml function.
I posted above:

SELECT from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'), schema_of_xml(' <ABSENDER> <RZLZ>R00000001</RZLZ> <NAME>Informatik GmbH</NAME> <STRASSE>Muster-Allee 90</STRASSE> <PLZ>60486</PLZ> <ORT>Frankfurt a.M.</ORT> <LAND>DE</LAND> <KONTAKT> <ANREDE>Herr</ANREDE> <VORNAME>Max</VORNAME> <ZUNAME>Mustermann</ZUNAME> <TELEFON>xxxxx/xxxx-xx</TELEFON> <FAX>xxx/xxxxx-xx</FAX> <EMAIL>max.mustermann@informatik.de</EMAIL> </KONTAKT> </ABSENDER>')) AS ABSENDER_XML FROM conformed

But it gave me just a bad result with most of the attributes not mapped.

object KONTAKT: null LAND: null NAME: null ORT: null PLZ: null RZLZ: "R00000001" STRASSE: null

This is not the real data as I "anonymized" the data before posting here.
So, the reason is that the <NAME>Informatik GmbH & Co.KG</NAME> contains an "&" what will make the from_xml function fail.
I used a simple replace to "fix" it (make it work for now)

%sql replace(ABSENDER, '&', '')

Now the object looks like that.

object KONTAKT: ANREDE: "Herr" EMAIL: "max.mustermann@informatik.de" FAX: "xxx/xxxxx-xx" TELEFON: "xxxxx/xxxx-xx" VORNAME: "Max" ZUNAME: "Mustermann" LAND: "DE" NAME: "Informatik GmbH Co.KG" ORT: "Frankfurt a.M." PLZ: "60486" RZLZ: "R00000001" STRASSE: "Muster-Allee 90"

I found it after several hours of debugging. Sad it gave me no error. It just did not map the object correctly.
Hope you can address this, as in the documentation it says, this feature is in public preview
from_xml function - Azure Databricks - Databricks SQL | Microsoft Learn

Re: XML DLT Autoloader - Ingestion of XML Files

avrm91 — Mon, 03 Jun 2024 19:10:51 GMT

The Assistant recommended

%sql replace(ABSENDER, '&', '&')

the output stays:

NAME: "Informatik GmbH & Co.KG"