cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

XML to Parquet files

reachrishav
New Contributor II

I have a requirement where I need to ingest large xml files and flatten the data before saving it as parquet files. I have created a python function to flatten the complex types (array & struct) from the ingested xml dataframe. I'm using the spark-xml library for reading the files. My concern is this is consuming a lot of time (> 1hr) for the ingestion and flattening. Any way I can do it more efficiently?

2 REPLIES 2

szymon_dybczak
Contributor

Hi @reachrishav ,

Since 14.3 there is a native support for read and write XML files. Maybe check if it works faster than the library that you've used:

Read and write XML files | Databricks on AWS

And you've mentioned that you write python function to flatten complex types. Do you use it as UDF? Because that could be performance bottelneck also:

What are user-defined functions (UDFs)? | Databricks on AWS

I am still on databricks runtine 12.2 LTS. Guess I'm using the same library for reading xml as the options are similar.
I'm using a custom python function to flatten the ingested df. The custom python func goes over all the columns of the input dataframe - if the column types are complex, i.e. struct or array - it continues to flatten it (explode if array, dot(.) operator if struct) until all the columns are simple types.

Something like:
df = spark.read.format('xml').load(path)
flattened_df = flatten_func(df)
flattened_df.write.format('parquet').save(destinationpath)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group