We have solution implemented for ingesting binary file ( .ZIP ) into delta lake, Currently we are using the below solution within our pipeline.
- Unzip the file and extract the XML file.
- Parse the XML using python libraries.
- Flatten the nested xml columns.
- Store it to delta table.
This solution is working fine for small set of files ( 25 ). When we are processing large set of files ( 650 ) it is taking more time than expected.
Would like to know if we have a better solution to speed up the process.
Few things to note about the Xml file, This is a nested XML file which is having around 600 columns.