Hi, looking for the right solution pattern for this scenario:
We have millions of relatively small XML files (currently sitting in ADLS) that we have to load into delta lake. Each XML file has to be read, parsed, and pivoted before writing to a delta table. The XML schemas can be different and drift over time. There are no dependencies between the files, and all can be appended wholesale to the table (i.e., without merging).
I have an implementation for this, but it’s ridiculously slow. If this were a scenario of small numbers of extremely large xml files, then I could see a simple way for Databricks to handle this by parallelizing the work over partitions/workers, but in this case the files are relatively tiny, so partitioning isn’t a thing. I’ve tried parallelizing with the ThreadPoolExecutor lib – it made a difference but not a material enough one. Also tried pyspark parallelize() to apply a udf to each file distributed in an RDD, but things got ugly. I’m probably not thinking about this the right way in terms of the correct architectural pattern that needs to be applied to my use case.