<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Auto Loader and source file structure optimisation in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/auto-loader-and-source-file-structure-optimisation/m-p/51416#M29147</link>
    <description>&lt;P&gt;Hi.&amp;nbsp; I have a question, and I've not been able to find an answer.&amp;nbsp; I'm sure there is one...I just haven't found it through searching and browsing the docs.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;How much does it matter&amp;nbsp;&lt;/STRONG&gt;(if it is indeed that simple)&amp;nbsp;&lt;STRONG&gt;if source files read by auto loader are in a single folder or structured by subfolders&amp;nbsp;&lt;/STRONG&gt;(e.g. YYYY \ MM \ DD).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My environment is Azure Databricks and ADLS gen2 (using hierarchical namespace).&amp;nbsp; In this case, I have 4 "folders" which each contain all the files we've ever received from various post API methods (1 folder for each method).&amp;nbsp; It was not set up to create subfolders based on date.&amp;nbsp; So there's currently from &amp;lt;1 million to &amp;gt; 5 million, depending on the method.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I need to migrate this data, and where this is coming from is - is it worth the effort of copying to a date-based structure, because it will make the auto loader part more efficient, or just dump it over as-is and carry on with life..?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 14 Nov 2023 02:13:34 GMT</pubDate>
    <dc:creator>ilarsen</dc:creator>
    <dc:date>2023-11-14T02:13:34Z</dc:date>
    <item>
      <title>Auto Loader and source file structure optimisation</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-and-source-file-structure-optimisation/m-p/51416#M29147</link>
      <description>&lt;P&gt;Hi.&amp;nbsp; I have a question, and I've not been able to find an answer.&amp;nbsp; I'm sure there is one...I just haven't found it through searching and browsing the docs.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;How much does it matter&amp;nbsp;&lt;/STRONG&gt;(if it is indeed that simple)&amp;nbsp;&lt;STRONG&gt;if source files read by auto loader are in a single folder or structured by subfolders&amp;nbsp;&lt;/STRONG&gt;(e.g. YYYY \ MM \ DD).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My environment is Azure Databricks and ADLS gen2 (using hierarchical namespace).&amp;nbsp; In this case, I have 4 "folders" which each contain all the files we've ever received from various post API methods (1 folder for each method).&amp;nbsp; It was not set up to create subfolders based on date.&amp;nbsp; So there's currently from &amp;lt;1 million to &amp;gt; 5 million, depending on the method.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I need to migrate this data, and where this is coming from is - is it worth the effort of copying to a date-based structure, because it will make the auto loader part more efficient, or just dump it over as-is and carry on with life..?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Nov 2023 02:13:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-and-source-file-structure-optimisation/m-p/51416#M29147</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2023-11-14T02:13:34Z</dc:date>
    </item>
    <item>
      <title>Re: Auto Loader and source file structure optimisation</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-and-source-file-structure-optimisation/m-p/51876#M29315</link>
      <description>&lt;P&gt;Thanks for your response, that does help.&amp;nbsp; From what I found - or didn't find, rather - it didn't seem to me like it would be a huge performance impact, either.&amp;nbsp; A full-scale test would perhaps be the only way for me to learn for sure, but that may not be worth the effort.&amp;nbsp; The flat file structure is historical now, a new process lands these files in a subfolder structure.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;That said, I am still interested if someone else comes across this and can shed any more light on the potential performance impacts of flat-vs-hierarchical source file folder structures with auto loader ingestion.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Nov 2023 22:47:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-and-source-file-structure-optimisation/m-p/51876#M29315</guid>
      <dc:creator>ilarsen</dc:creator>
      <dc:date>2023-11-14T22:47:02Z</dc:date>
    </item>
  </channel>
</rss>

