<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to read multiple tiny XML files in parallel in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32783#M23916</link>
    <description>&lt;P&gt;Thank you @Hubert Dudek​&amp;nbsp;for the suggestion. Similar to your recommendation, we added a step in our pipeline to merge the small files to large files and make them available for the spark job. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 05 Sep 2022 14:11:38 GMT</pubDate>
    <dc:creator>Paramesh</dc:creator>
    <dc:date>2022-09-05T14:11:38Z</dc:date>
    <item>
      <title>How to read multiple tiny XML files in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32780#M23913</link>
      <description>&lt;P&gt;Hi team, &lt;/P&gt;&lt;P&gt;we are trying to read multiple tiny XML files, able to parse them using the data bricks XML jar, but is there any way to read these files in parallel and distribute the load across the cluster? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;right now our job is taking 90% of the time reading the files, there is only one transformation i.e. flattening the xmls. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please suggest if there is any way to improve the performance. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;code snippet: &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    def rawXml2df(fnames: List[String], ss: SparkSession): DataFrame = {
     // print(s"fanames ${fnames.mkString(",")}")
      ss.read
        .format("com.databricks.spark.xml")
        .schema(thSchema)
        .option("rowTag", "ns2:TransactionHistory")
        .option("attributePrefix", "_")
        .load(fnames.mkString(","))
    }
&amp;nbsp;
val df0 = rawXml2df(getListOfFiles(new File("ds-tools/aws-glue-local-test/src/main/scala/tracelink/ds/input")), sparkSession)
&amp;nbsp;
Logs: 
&amp;nbsp;
2022-09-01 13:37:36 INFO  - Finished task 14196.0 in stage 2.0 (TID 33078). 2258 bytes result sent to driver
2022-09-01 13:37:36 INFO  - Starting task 14197.0 in stage 2.0 (TID 33079, localhost, executor driver, partition 14197, PROCESS_LOCAL, 8024 bytes)
2022-09-01 13:37:36 INFO  - Finished task 14196.0 in stage 2.0 (TID 33078) in 44 ms on localhost (executor driver) (14197/18881)
2022-09-01 13:37:36 INFO  - Running task 14197.0 in stage 2.0 (TID 33079)
2022-09-01 13:37:36 INFO  - Input split: file:/Users/john/ds-tools/aws-glue-local-test/src/main/scala/ds/input/09426edf-39e0-44d7-bda5-be49ff56512e:0+2684
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 17:45:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32780#M23913</guid>
      <dc:creator>Paramesh</dc:creator>
      <dc:date>2022-09-01T17:45:07Z</dc:date>
    </item>
    <item>
      <title>Re: How to read multiple tiny XML files in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32781#M23914</link>
      <description>&lt;P&gt;As of my knowledge, there are not any options to optimize your code. &lt;A href="https://github.com/databricks/spark-xml" target="test_blank"&gt;https://github.com/databricks/spark-xml&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It is the correct and the only way for reading XMLs, so on the databricks side, there is not much you can do except experiment with other cluster configurations.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Reading multiple small files is always slow. Therefore, it is common to know an issue called the "tiny files problem."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I don't know your architecture, but maybe when XMLs are saved, files can be appended to the previous one (or some trigger could merge them).&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 20:26:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32781#M23914</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-09-01T20:26:36Z</dc:date>
    </item>
    <item>
      <title>Re: How to read multiple tiny XML files in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32783#M23916</link>
      <description>&lt;P&gt;Thank you @Hubert Dudek​&amp;nbsp;for the suggestion. Similar to your recommendation, we added a step in our pipeline to merge the small files to large files and make them available for the spark job. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Sep 2022 14:11:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32783#M23916</guid>
      <dc:creator>Paramesh</dc:creator>
      <dc:date>2022-09-05T14:11:38Z</dc:date>
    </item>
    <item>
      <title>Re: How to read multiple tiny XML files in parallel</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32784#M23917</link>
      <description>&lt;P&gt;Thank you for the follow-up. Added my new comment&lt;/P&gt;</description>
      <pubDate>Mon, 05 Sep 2022 14:11:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-multiple-tiny-xml-files-in-parallel/m-p/32784#M23917</guid>
      <dc:creator>Paramesh</dc:creator>
      <dc:date>2022-09-05T14:11:59Z</dc:date>
    </item>
  </channel>
</rss>

