topic Re: XML to Parquet files in Data Engineering

XML to Parquet files

reachrishav — Fri, 09 Aug 2024 04:01:05 GMT

I have a requirement where I need to ingest large xml files and flatten the data before saving it as parquet files. I have created a python function to flatten the complex types (array & struct) from the ingested xml dataframe. I'm using the spark-xml library for reading the files. My concern is this is consuming a lot of time (> 1hr) for the ingestion and flattening. Any way I can do it more efficiently?

Re: XML to Parquet files

szymon_dybczak — Fri, 09 Aug 2024 05:46:51 GMT

Hi @reachrishav ,

Since 14.3 there is a native support for read and write XML files. Maybe check if it works faster than the library that you've used:

Read and write XML files | Databricks on AWS

And you've mentioned that you write python function to flatten complex types. Do you use it as UDF? Because that could be performance bottelneck also:

What are user-defined functions (UDFs)? | Databricks on AWS

Re: XML to Parquet files

reachrishav — Fri, 09 Aug 2024 06:00:21 GMT

I am still on databricks runtine 12.2 LTS. Guess I'm using the same library for reading xml as the options are similar.
I'm using a custom python function to flatten the ingested df. The custom python func goes over all the columns of the input dataframe - if the column types are complex, i.e. struct or array - it continues to flatten it (explode if array, dot(.) operator if struct) until all the columns are simple types.

Something like:
df = spark.read.format('xml').load(path)
flattened_df = flatten_func(df)
flattened_df.write.format('parquet').save(destinationpath)