<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: XML to Parquet files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82466#M36660</link>
    <description>&lt;P&gt;I am still on databricks runtine 12.2 LTS. Guess I'm using the same library for reading xml as the options are similar.&lt;BR /&gt;I'm using a custom python function to flatten the ingested df. The custom python func goes over all the columns of the input dataframe - if the column types are complex, i.e. struct or array - it continues to flatten it (explode if array, dot(.) operator if struct) until all the columns are simple types.&lt;BR /&gt;&lt;BR /&gt;Something like:&lt;BR /&gt;df = spark.read.format('xml').load(path)&lt;BR /&gt;flattened_df = flatten_func(df)&lt;BR /&gt;flattened_df.write.format('parquet').save(destinationpath)&lt;/P&gt;</description>
    <pubDate>Fri, 09 Aug 2024 06:00:21 GMT</pubDate>
    <dc:creator>reachrishav</dc:creator>
    <dc:date>2024-08-09T06:00:21Z</dc:date>
    <item>
      <title>XML to Parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82457#M36657</link>
      <description>&lt;P&gt;I have a requirement where I need to ingest large xml files and flatten the data before saving it as parquet files. I have created a python function to flatten the complex types (array &amp;amp; struct) from the ingested xml dataframe. I'm using the spark-xml library for reading the files. My concern is this is consuming a lot of time (&amp;gt; 1hr) for the ingestion and flattening. Any way I can do it more efficiently?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Aug 2024 04:01:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82457#M36657</guid>
      <dc:creator>reachrishav</dc:creator>
      <dc:date>2024-08-09T04:01:05Z</dc:date>
    </item>
    <item>
      <title>Re: XML to Parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82461#M36658</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/115053"&gt;@reachrishav&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;Since 14.3 there is a native support for read and write XML files.&lt;SPAN&gt;&amp;nbsp;Maybe check if it works faster than the library that you've used:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/query/formats/xml.html" target="_blank" rel="noopener"&gt;Read and write XML files | Databricks on AWS&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;And you've mentioned that you write python function to flatten complex types. Do you use it as UDF? Because that could be performance bottelneck also:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/udf/index.html#which-udfs-are-most-efficient" target="_blank"&gt;What are user-defined functions (UDFs)? | Databricks on AWS&lt;/A&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Aug 2024 05:46:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82461#M36658</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-08-09T05:46:51Z</dc:date>
    </item>
    <item>
      <title>Re: XML to Parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82466#M36660</link>
      <description>&lt;P&gt;I am still on databricks runtine 12.2 LTS. Guess I'm using the same library for reading xml as the options are similar.&lt;BR /&gt;I'm using a custom python function to flatten the ingested df. The custom python func goes over all the columns of the input dataframe - if the column types are complex, i.e. struct or array - it continues to flatten it (explode if array, dot(.) operator if struct) until all the columns are simple types.&lt;BR /&gt;&lt;BR /&gt;Something like:&lt;BR /&gt;df = spark.read.format('xml').load(path)&lt;BR /&gt;flattened_df = flatten_func(df)&lt;BR /&gt;flattened_df.write.format('parquet').save(destinationpath)&lt;/P&gt;</description>
      <pubDate>Fri, 09 Aug 2024 06:00:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/xml-to-parquet-files/m-p/82466#M36660</guid>
      <dc:creator>reachrishav</dc:creator>
      <dc:date>2024-08-09T06:00:21Z</dc:date>
    </item>
  </channel>
</rss>

