cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Need pattern for loading a million small XML files

CDICSteph
New Contributor

Hi, looking for the right solution pattern for this scenario:

We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta table. The XML schemas can be different and drift over time. There are no dependencies between the files, and all can be appended wholesale to the table (i.e., without merging).

I have an implementation for this, but it’s ridiculously slow. If this were a scenario of small numbers of extremely large xml files, then I could see a simple way for Databricks to handle this by parallelizing the work over partitions/workers, but in this case the files are relatively tiny, so partitioning isn’t a thing. I’ve tried parallelizing with the ThreadPoolExecutor lib – it made a difference but not a material enough one. Also tried pyspark parallelize() to apply a udf to each file distributed in an RDD, but things got ugly. I’m probably not thinking about this the right way in terms of the correct architectural pattern that needs to be applied to my use case. 

2 REPLIES 2

jose_gonzalez
Moderator
Moderator

You can use auto loader for this. Please check this sample KB containing step by step on how to do it. https://kb.databricks.com/streaming/stream-xml-auto-loader

Anonymous
Not applicable

Hi @Steph Swierenga​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.