<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Dynamically supplying partitions to autoloader in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33015#M24106</link>
    <description>&lt;P&gt;We are having a streaming use case and we see a lot of time in listing from azure.&lt;/P&gt;&lt;P&gt;Is it possible to supply partition to autoloader dynamically on the fly&lt;/P&gt;</description>
    <pubDate>Thu, 16 Dec 2021 14:39:39 GMT</pubDate>
    <dc:creator>Soma</dc:creator>
    <dc:date>2021-12-16T14:39:39Z</dc:date>
    <item>
      <title>Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33015#M24106</link>
      <description>&lt;P&gt;We are having a streaming use case and we see a lot of time in listing from azure.&lt;/P&gt;&lt;P&gt;Is it possible to supply partition to autoloader dynamically on the fly&lt;/P&gt;</description>
      <pubDate>Thu, 16 Dec 2021 14:39:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33015#M24106</guid>
      <dc:creator>Soma</dc:creator>
      <dc:date>2021-12-16T14:39:39Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33017#M24108</link>
      <description>&lt;P&gt;I know pain with listings on azure bill &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt; in my case I solved it with lower trigger frequency but&lt;/P&gt;&lt;P&gt;good option can be &lt;B&gt;File notification mode &lt;/B&gt;additionally yo can set own queue and event grid to have more control over it (although first experiments can be done with automated ones):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;&lt;I&gt;File notification&lt;/I&gt;&lt;/B&gt;&lt;I&gt;: Uses Azure Event Grid and Queue Storage services that subscribe to file events from the input directory. Auto Loader automatically sets up the Azure Event Grid and Queue Storage services. File notification mode is more performant and scalable for large input directories. To use this mode, you must configure &lt;/I&gt;&lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2#permissions" alt="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2#permissions" target="_blank"&gt;&lt;I&gt;permissions&lt;/I&gt;&lt;/A&gt;&lt;I&gt; for the Azure Event Grid and Queue Storage services and specify&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;.option("cloudFiles.useNotifications","true")&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;. File notifications are supported for ADLS Gen2 and Azure Blob Storage.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;source: &lt;A href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2" target="test_blank"&gt;https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Dec 2021 15:23:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33017#M24108</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-12-16T15:23:29Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33018#M24109</link>
      <description>&lt;P&gt;Hi yes it is taking long time and planning to use trigger once with high frequency will also check with event grid but curious why spark can't have a option of take last 2 hrs or 1 hrs for example based on UTC timestamp so that spark will save a lot of time and configuring event &lt;/P&gt;&lt;P&gt;grid with custom trigger do need considerable time and effort​&lt;/P&gt;</description>
      <pubDate>Thu, 16 Dec 2021 15:29:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33018#M24109</guid>
      <dc:creator>Soma</dc:creator>
      <dc:date>2021-12-16T15:29:18Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33019#M24110</link>
      <description>&lt;P&gt;Hi @somanath Sankaran​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I will recommend to use trigger.AvailableNow instead of trigger.once. Here is the link to the docs &lt;A href="https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-auto-loader" target="test_blank"&gt;https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-auto-loader&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Goin back to your original question, you can use incremental listing. &amp;nbsp;partitions can be considered lexically ordered if data is processed once a day, file paths containing timestamps can be considered lexically ordered.&lt;/P&gt;&lt;P&gt;Docs here &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html#incremental-listing" target="test_blank"&gt;https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html#incremental-listing&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jan 2022 00:11:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33019#M24110</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-01-26T00:11:42Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33020#M24111</link>
      <description>&lt;P&gt;Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in  listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close  and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side) ​&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jan 2022 01:38:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33020#M24111</guid>
      <dc:creator>Soma</dc:creator>
      <dc:date>2022-01-26T01:38:21Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33021#M24112</link>
      <description>&lt;P&gt;@somanath Sankaran​&amp;nbsp;- Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly? &lt;/P&gt;</description>
      <pubDate>Wed, 26 Jan 2022 16:02:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/33021#M24112</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-26T16:02:13Z</dc:date>
    </item>
    <item>
      <title>Re: Dynamically supplying partitions to autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/88774#M37611</link>
      <description>&lt;P&gt;Hey i am curious how do you monitor the listing costs. In my case it will be listing a folder baes on table name and inside each table name folder i will have folders as yearmonthday for each day basically and in a year there will be 365 folders. My check point looks inside each table name folder. Each day folder will include maybe 100 files per day. Do you think it is better to supply day folder as sink to reduce costs?&lt;/P&gt;</description>
      <pubDate>Thu, 05 Sep 2024 17:18:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dynamically-supplying-partitions-to-autoloader/m-p/88774#M37611</guid>
      <dc:creator>Changedata</dc:creator>
      <dc:date>2024-09-05T17:18:13Z</dc:date>
    </item>
  </channel>
</rss>

