05-31-2024 06:02 AM - edited 05-31-2024 07:01 AM
I want to ingest multiple XML files with varying but similar structures without defining a schema.
For example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<LIEFERUNG>
<ABSENDER>
<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH </NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>
</ABSENDER>
<ERSTELLER>
<BLZ>0000000000000</BLZ>
<NAME>MUSTERBANK</NAME>
<STRASSE>MUSTER STR. 1</STRASSE>
<PLZ>12345</PLZ>
<ORT>Musterhausen</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Frau</ANREDE>
<VORNAME>Maxime</VORNAME>
<ZUNAME>Musterdame</ZUNAME>
<ABTEILUNG>Contolling</ABTEILUNG>
<TELEFON>xxxxx xx-xxxxx</TELEFON>
<FAX>xxxxx xx-xxxxx</FAX>
<EMAIL>maxime.musterdame@aol.de</EMAIL>
</KONTAKT>
</ERSTELLER>
<MELDUNG erstellzeit="2020-09-02T11:23:33">
.....
</MELDUNG>
</LIEFERUNG>
I attempted to load the XML within DLT pipelines using the following code:
%sql
CREATE OR REFRESH STREAMING TABLE conformed
COMMENT "BRONZE"
AS
SELECT
*
FROM
cloud_files(
"<file-path>"
"xml",
map(
"rowTag",
"LIEFERUNG",
"mode",
"FAILFAST",
"inferSchema",
"true",
"encoding",
"ISO-8859-1",
"ignoreNamespace",
"false",
"timestampFormat",
"yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]",
"timestampNTZFormat",
"yyyy-MM-dd'T'HH:mm:ss[.SSS]",
"dateFormat",
"yyyy-MM-dd",
"locale",
"de-DE",
"readerCaseSensitive",
"true"
)
) as src
The result is a table with 3 columns:
<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH </NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>
Now the questions:
I already tried with those code:
SELECT
from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'),
schema_of_xml(CONCAT('<ABSENDER>', '<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH </NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>', '</ABSENDER>'))) AS ABSENDER_XML
FROM conformed
But it gave me just a bad result with most of the attributes not mapped.
object
KONTAKT: null
LAND: null
NAME: null
ORT: null
PLZ: null
RZLZ: "R00000001"
STRASSE: null
06-03-2024 02:55 AM
Hi @avrm91,
from_xml
function. However, since your XML structure varies, it’s challenging to map all attributes directly.schema_of_xml
, consider using a more flexible approach. You can extract the relevant parts of the XML into a JSON-like structure (a map or struct) and then work with that.from_xml
to extract the relevant parts of the XML.06-03-2024 02:55 AM
Hi @avrm91,
from_xml
function. However, since your XML structure varies, it’s challenging to map all attributes directly.schema_of_xml
, consider using a more flexible approach. You can extract the relevant parts of the XML into a JSON-like structure (a map or struct) and then work with that.from_xml
to extract the relevant parts of the XML.06-03-2024 07:29 AM
@Kaniz_Fatma Thanks a lot.
I found an issue in from_xml function.
I posted above:
SELECT
from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'),
schema_of_xml('
<ABSENDER>
<RZLZ>R00000001</RZLZ>
<NAME>Informatik GmbH</NAME>
<STRASSE>Muster-Allee 90</STRASSE>
<PLZ>60486</PLZ>
<ORT>Frankfurt a.M.</ORT>
<LAND>DE</LAND>
<KONTAKT>
<ANREDE>Herr</ANREDE>
<VORNAME>Max</VORNAME>
<ZUNAME>Mustermann</ZUNAME>
<TELEFON>xxxxx/xxxx-xx</TELEFON>
<FAX>xxx/xxxxx-xx</FAX>
<EMAIL>max.mustermann@informatik.de</EMAIL>
</KONTAKT>
</ABSENDER>')) AS ABSENDER_XML
FROM conformed
But it gave me just a bad result with most of the attributes not mapped.
object
KONTAKT: null
LAND: null
NAME: null
ORT: null
PLZ: null
RZLZ: "R00000001"
STRASSE: null
This is not the real data as I "anonymized" the data before posting here.
So, the reason is that the <NAME>Informatik GmbH & Co.KG</NAME> contains an "&" what will make the from_xml function fail.
I used a simple replace to "fix" it (make it work for now)
%sql
replace(ABSENDER, '&', '')
Now the object looks like that.
object
KONTAKT:
ANREDE: "Herr"
EMAIL: "max.mustermann@informatik.de"
FAX: "xxx/xxxxx-xx"
TELEFON: "xxxxx/xxxx-xx"
VORNAME: "Max"
ZUNAME: "Mustermann"
LAND: "DE"
NAME: "Informatik GmbH Co.KG"
ORT: "Frankfurt a.M."
PLZ: "60486"
RZLZ: "R00000001"
STRASSE: "Muster-Allee 90"
I found it after several hours of debugging. Sad it gave me no error. It just did not map the object correctly.
Hope you can address this, as in the documentation it says, this feature is in public preview
from_xml function - Azure Databricks - Databricks SQL | Microsoft Learn
06-03-2024 12:10 PM
The Assistant recommended
%sql
replace(ABSENDER, '&', '&')
the output stays:
NAME: "Informatik GmbH & Co.KG"
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group