I want to ingest multiple XML files with varying but similar structures without defining a schema.
For example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<LIEFERUNG>
<ABSENDER>
<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH </NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>
</ABSENDER>
<ERSTELLER>
<BLZ>0000000000000</BLZ>
<NAME>MUSTERBANK</NAME>
<STRASSE>MUSTER STR. 1</STRASSE>
<PLZ>12345</PLZ>
<ORT>Musterhausen</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Frau</ANREDE>
<VORNAME>Maxime</VORNAME>
<ZUNAME>Musterdame</ZUNAME>
<ABTEILUNG>Contolling</ABTEILUNG>
<TELEFON>xxxxx xx-xxxxx</TELEFON>
<FAX>xxxxx xx-xxxxx</FAX>
<EMAIL>maxime.musterdame@aol.de</EMAIL>
</KONTAKT>
</ERSTELLER>
<MELDUNG erstellzeit="2020-09-02T11:23:33">
.....
</MELDUNG>
</LIEFERUNG>
I attempted to load the XML within DLT pipelines using the following code:
%sql
CREATE OR REFRESH STREAMING TABLE conformed
COMMENT "BRONZE"
AS
SELECT
*
FROM
cloud_files(
"<file-path>"
"xml",
map(
"rowTag",
"LIEFERUNG",
"mode",
"FAILFAST",
"inferSchema",
"true",
"encoding",
"ISO-8859-1",
"ignoreNamespace",
"false",
"timestampFormat",
"yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]",
"timestampNTZFormat",
"yyyy-MM-dd'T'HH:mm:ss[.SSS]",
"dateFormat",
"yyyy-MM-dd",
"locale",
"de-DE",
"readerCaseSensitive",
"true"
)
) as src
The result is a table with 3 columns:
Column Type
ABSENDER string
ERSTELLER string
MELDUNG string
Within those columns is XML Code, e.g.
ABSENDER:
<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH </NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>
Now the questions:
- Is it best practice for BRONZE to have xml stored like this, as it will have the least "issues" with schema changes, as the schema will just change within these tags?
- Nevertheless, how can I make it within BRONZE DLT Step with SQL to explode the xml without defining the schema?
- And how would I explode the XML within SILVER without defining the schema?
I already tried with those code:
SELECT
from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'),
schema_of_xml(CONCAT('<ABSENDER>', '<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH </NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>', '</ABSENDER>'))) AS ABSENDER_XML
FROM conformed
But it gave me just a bad result with most of the attributes not mapped.
object
KONTAKT: null
LAND: null
NAME: null
ORT: null
PLZ: null
RZLZ: "R00000001"
STRASSE: null