cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Need pattern for loading a million small XML files

CDICSteph
New Contributor

Hi, looking for the right solution pattern for this scenario:

We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta table. The XML schemas can be different and drift over time. There are no dependencies between the files, and all can be appended wholesale to the table (i.e., without merging).

I have an implementation for this, but it’s ridiculously slow. If this were a scenario of small numbers of extremely large xml files, then I could see a simple way for Databricks to handle this by parallelizing the work over partitions/workers, but in this case the files are relatively tiny, so partitioning isn’t a thing. I’ve tried parallelizing with the ThreadPoolExecutor lib – it made a difference but not a material enough one. Also tried pyspark parallelize() to apply a udf to each file distributed in an RDD, but things got ugly. I’m probably not thinking about this the right way in terms of the correct architectural pattern that needs to be applied to my use case. 

2 REPLIES 2

jose_gonzalez
Databricks Employee
Databricks Employee

You can use auto loader for this. Please check this sample KB containing step by step on how to do it. https://kb.databricks.com/streaming/stream-xml-auto-loader

Anonymous
Not applicable

Hi @Steph Swierenga​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group