cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

XML DLT Autoloader - Ingestion of XML Files

avrm91
Contributor

I want to ingest multiple XML files with varying but similar structures without defining a schema.

For example: 

 

 

<?xml version="1.0" encoding="ISO-8859-1"?>
<LIEFERUNG>
  <ABSENDER>
    <RZLZ>R00000001</RZLZ>
    <NAME>Informatik GmbH </NAME>
    <STRASSE>Muster-Allee 90</STRASSE>
    <PLZ>60486</PLZ>
    <ORT>Frankfurt a.M.</ORT>
    <LAND>DE</LAND>
    <KONTAKT>
      <ANREDE>Herr</ANREDE>
      <VORNAME>Max</VORNAME>
      <ZUNAME>Mustermann</ZUNAME>
      <TELEFON>xxxxx/xxxx-xx</TELEFON>
      <FAX>xxx/xxxxx-xx</FAX>
      <EMAIL>max.mustermann@informatik.de</EMAIL>
    </KONTAKT>
  </ABSENDER>
  <ERSTELLER>
    <BLZ>0000000000000</BLZ>
    <NAME>MUSTERBANK</NAME>
    <STRASSE>MUSTER STR. 1</STRASSE>
    <PLZ>12345</PLZ>
    <ORT>Musterhausen</ORT>
    <LAND>DE</LAND>
    <KONTAKT>
      <ANREDE>Frau</ANREDE>
      <VORNAME>Maxime</VORNAME>
      <ZUNAME>Musterdame</ZUNAME>
      <ABTEILUNG>Contolling</ABTEILUNG>
      <TELEFON>xxxxx xx-xxxxx</TELEFON>
      <FAX>xxxxx xx-xxxxx</FAX>
      <EMAIL>maxime.musterdame@aol.de</EMAIL>
    </KONTAKT>
  </ERSTELLER>
  <MELDUNG erstellzeit="2020-09-02T11:23:33">
    ..... 
  </MELDUNG>
</LIEFERUNG>

 

 

I attempted to load the XML within DLT pipelines using the following code:

 

 

%sql
CREATE OR REFRESH STREAMING TABLE conformed
COMMENT "BRONZE" 
AS
SELECT
  *
FROM
  cloud_files(
    "<file-path>"
    "xml",
    map(
      "rowTag",
      "LIEFERUNG",
      "mode",
      "FAILFAST",
      "inferSchema",
      "true",
      "encoding",
      "ISO-8859-1",
      "ignoreNamespace",
      "false",
      "timestampFormat",
      "yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]",
      "timestampNTZFormat",
      "yyyy-MM-dd'T'HH:mm:ss[.SSS]",
      "dateFormat",
      "yyyy-MM-dd",
      "locale",
      "de-DE",
      "readerCaseSensitive",
      "true"
    )
  ) as src

 

 

The result is a table with 3 columns:

Column          Type
ABSENDER   
string
ERSTELLER  string
MELDUNG     string
Within those columns is XML Code, e.g.
ABSENDER:

 

 

<RZLZ>R00000001</RZLZ>
    <NAME>Informatik GmbH </NAME>
    <STRASSE>Muster-Allee 90</STRASSE>
    <PLZ>60486</PLZ>
    <ORT>Frankfurt a.M.</ORT>
    <LAND>DE</LAND>
    <KONTAKT>
      <ANREDE>Herr</ANREDE>
      <VORNAME>Max</VORNAME>
      <ZUNAME>Mustermann</ZUNAME>
      <TELEFON>xxxxx/xxxx-xx</TELEFON>
      <FAX>xxx/xxxxx-xx</FAX>
      <EMAIL>max.mustermann@informatik.de</EMAIL>
    </KONTAKT>

 

 

Now the questions:

  • Is it best practice for BRONZE to have xml stored like this, as it will have the least "issues" with schema changes, as the schema will just change within these tags?
  • Nevertheless, how can I make it within BRONZE DLT Step with SQL to explode the xml without defining the schema?
  • And how would I explode the XML within SILVER without defining the schema?

I already tried with those code:  

 

 

SELECT
  from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'), 
    schema_of_xml(CONCAT('<ABSENDER>', '<RZLZ>R00000001</RZLZ>
      <NAME>Informatik GmbH </NAME>
      <STRASSE>Muster-Allee 90</STRASSE>
      <PLZ>60486</PLZ>
      <ORT>Frankfurt a.M.</ORT>
      <LAND>DE</LAND>
      <KONTAKT>
        <ANREDE>Herr</ANREDE>
        <VORNAME>Max</VORNAME>
        <ZUNAME>Mustermann</ZUNAME>
        <TELEFON>xxxxx/xxxx-xx</TELEFON>
        <FAX>xxx/xxxxx-xx</FAX>
        <EMAIL>max.mustermann@informatik.de</EMAIL>
      </KONTAKT>', '</ABSENDER>')))  AS ABSENDER_XML
FROM conformed

 

 

But it gave me just a bad result with most of the attributes not mapped. 

 

 

object
KONTAKT: null
LAND: null
NAME: null
ORT: null
PLZ: null
RZLZ: "R00000001"
STRASSE: null

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @avrm91

  • Storing XML data as-is (without a predefined schema) in the BRONZE layer can be a reasonable approach, especially if the schema changes frequently within specific tags. It allows you to capture all the data without worrying about schema modifications.
  • However, keep in mind that querying and processing such unstructured data can be complex. You wonโ€™t benefit from schema-based optimizations, and youโ€™ll need to handle schema evolution carefully during data transformations.
  • To explode the XML data within the BRONZE layer without defining a schema, you can use the from_xml function. However, since your XML structure varies, itโ€™s challenging to map all attributes directly.
  • Instead of using schema_of_xml, consider using a more flexible approach. You can extract the relevant parts of the XML into a JSON-like structure (a map or struct) and then work with that.
  • In the SILVER layer, you can follow a similar approach. Again, use from_xml to extract the relevant parts of the XML.
  • Since youโ€™re dealing with varying structures, consider creating a generic transformation that handles different cases dynamically. For example, you can extract all child elements of ABSENDER into a struct or map.
  • To map attributes correctly, youโ€™ll need to manually extract them from the JSON-like structure (map or struct) and create separate columns.

View solution in original post

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @avrm91

  • Storing XML data as-is (without a predefined schema) in the BRONZE layer can be a reasonable approach, especially if the schema changes frequently within specific tags. It allows you to capture all the data without worrying about schema modifications.
  • However, keep in mind that querying and processing such unstructured data can be complex. You wonโ€™t benefit from schema-based optimizations, and youโ€™ll need to handle schema evolution carefully during data transformations.
  • To explode the XML data within the BRONZE layer without defining a schema, you can use the from_xml function. However, since your XML structure varies, itโ€™s challenging to map all attributes directly.
  • Instead of using schema_of_xml, consider using a more flexible approach. You can extract the relevant parts of the XML into a JSON-like structure (a map or struct) and then work with that.
  • In the SILVER layer, you can follow a similar approach. Again, use from_xml to extract the relevant parts of the XML.
  • Since youโ€™re dealing with varying structures, consider creating a generic transformation that handles different cases dynamically. For example, you can extract all child elements of ABSENDER into a struct or map.
  • To map attributes correctly, youโ€™ll need to manually extract them from the JSON-like structure (map or struct) and create separate columns.

avrm91
Contributor

@Kaniz_Fatma Thanks a lot.
I found an issue in from_xml function.
I posted above:

 

SELECT
  from_xml(CONCAT('<ABSENDER>', ABSENDER, '</ABSENDER>'), 
    schema_of_xml('
		<ABSENDER>
			<RZLZ>R00000001</RZLZ>
			<NAME>Informatik GmbH</NAME>
			<STRASSE>Muster-Allee 90</STRASSE>
			<PLZ>60486</PLZ>
			<ORT>Frankfurt a.M.</ORT>
			<LAND>DE</LAND>
			<KONTAKT>
				<ANREDE>Herr</ANREDE>
				<VORNAME>Max</VORNAME>
				<ZUNAME>Mustermann</ZUNAME>
				<TELEFON>xxxxx/xxxx-xx</TELEFON>
				<FAX>xxx/xxxxx-xx</FAX>
				<EMAIL>max.mustermann@informatik.de</EMAIL>
			</KONTAKT>
		</ABSENDER>'))  AS ABSENDER_XML
FROM conformed

 

But it gave me just a bad result with most of the attributes not mapped. 

 

object
KONTAKT: null
LAND: null
NAME: null
ORT: null
PLZ: null
RZLZ: "R00000001"
STRASSE: null

 

This is not the real data as I "anonymized" the data before posting here.
So, the reason is that the 
<NAME>Informatik GmbH & Co.KG</NAME> contains an "&" what will make the from_xml function fail. 
I used a simple replace to "fix" it (make it work for now)

 

%sql
replace(ABSENDER, '&', '')

 

Now the object looks like that.

 

object
KONTAKT: 
	ANREDE: "Herr"
	EMAIL: "max.mustermann@informatik.de"
	FAX: "xxx/xxxxx-xx"
	TELEFON: "xxxxx/xxxx-xx"
	VORNAME: "Max"
	ZUNAME: "Mustermann"
LAND: "DE"
NAME: "Informatik GmbH  Co.KG"
ORT: "Frankfurt a.M."
PLZ: "60486"
RZLZ: "R00000001"
STRASSE: "Muster-Allee 90"

 

I found it after several hours of debugging. Sad it gave me no error. It just did not map the object correctly.
Hope you can address this, as in the documentation it says, this feature is in public preview
from_xml function - Azure Databricks - Databricks SQL | Microsoft Learn

The Assistant recommended 

%sql
replace(ABSENDER, '&', '&amp;')

the output stays:

NAME: "Informatik GmbH & Co.KG"

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group