XML to Parquet files

reachrishav · ‎08-08-2024

I have a requirement where I need to ingest large xml files and flatten the data before saving it as parquet files. I have created a python function to flatten the complex types (array & struct) from the ingested xml dataframe. I'm using the spark-xml library for reading the files. My concern is this is consuming a lot of time (> 1hr) for the ingestion and flattening. Any way I can do it more efficiently?