Databricks Community

CDICSteph · ‎04-28-2023

Hi, looking for the right solution pattern for this scenario:

We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta table. The XML schemas can be different and drift over time. There are no dependencies between the files, and all can be appended wholesale to the table (i.e., without merging).

I have an implementation for this, but it’s ridiculously slow. If this were a scenario of small numbers of extremely large xml files, then I could see a simple way for Databricks to handle this by parallelizing the work over partitions/workers, but in this case the files are relatively tiny, so partitioning isn’t a thing. I’ve tried parallelizing with the ThreadPoolExecutor lib – it made a difference but not a material enough one. Also tried pyspark parallelize() to apply a udf to each file distributed in an RDD, but things got ugly. I’m probably not thinking about this the right way in terms of the correct architectural pattern that needs to be applied to my use case.

jose_gonzalez · ‎04-28-2023

You can use auto loader for this. Please check this sample KB containing step by step on how to do it. https://kb.databricks.com/streaming/stream-xml-auto-loader

Anonymous · ‎04-29-2023

Hi @Steph Swierenga

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks Community

Need pattern for loading a million small XML files

Join Us as a Local Community Builder!

🌟 Community Sparks of the Week | September 19 – 25 🌟

Run OpenAI Models Directly on Databricks

Solution Accelerator Series | #3 - Build Demand Forecasts at Scale

🚀 Weekly Delta (17-23 September): A Look Back at This Week’s Top Community Highlights!

Announcing Public Preview of Databricks One