<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Notifications have file information but dataframe is empty using autoloader file notification mode in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/notifications-have-file-information-but-dataframe-is-empty-using/m-p/106110#M42386</link>
    <description>&lt;P&gt;Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using &lt;STRONG&gt;foreachbatch&lt;/STRONG&gt; sink to process the data before merging with tables in delta lake.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Issue&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Notifications are generated for new files and when streaming job runs, it shows that there is epoch_id generated for each batch processed in&amp;nbsp;&lt;SPAN&gt;foreachBatch() but the dataframe for same epoch_id is appearing empty.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The file which is being pointed out in the notification does contain data, so its not the source data is empty.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Moreover, if I switch it to directory listing mode (default), it works.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Following file notification specific options are set:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;options['cloudFiles.includeExistingFiles']  = 'false'
options["cloudFiles.subscriptionId"]        = cloudfiles_subscriptionid
options["cloudFiles.tenantId"]              = cloudfiles_tenantid
options["cloudFiles.clientId"]              = cloudfiles_clientid
options["cloudFiles.clientSecret"]          = cloudfiles_clientsecret
options["cloudFiles.resourceGroup"]         = cloudfiles_resourcegroup
options["cloudFiles.fetchParallelism"]      = 5
options["cloudFiles.resourceTag.streaming_job_autoloader_file_notification_enabled"]  = 'true'
options["cloudFiles.resourceTag.streaming_job_autoloader_stream_id"]  = 'some_id'
options["cloudFiles.queueName"]             = "some_pregenerated_queue"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 17 Jan 2025 15:28:23 GMT</pubDate>
    <dc:creator>Abdul-Mannan</dc:creator>
    <dc:date>2025-01-17T15:28:23Z</dc:date>
    <item>
      <title>Notifications have file information but dataframe is empty using autoloader file notification mode</title>
      <link>https://community.databricks.com/t5/data-engineering/notifications-have-file-information-but-dataframe-is-empty-using/m-p/106110#M42386</link>
      <description>&lt;P&gt;Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using &lt;STRONG&gt;foreachbatch&lt;/STRONG&gt; sink to process the data before merging with tables in delta lake.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Issue&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Notifications are generated for new files and when streaming job runs, it shows that there is epoch_id generated for each batch processed in&amp;nbsp;&lt;SPAN&gt;foreachBatch() but the dataframe for same epoch_id is appearing empty.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The file which is being pointed out in the notification does contain data, so its not the source data is empty.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Moreover, if I switch it to directory listing mode (default), it works.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Following file notification specific options are set:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;options['cloudFiles.includeExistingFiles']  = 'false'
options["cloudFiles.subscriptionId"]        = cloudfiles_subscriptionid
options["cloudFiles.tenantId"]              = cloudfiles_tenantid
options["cloudFiles.clientId"]              = cloudfiles_clientid
options["cloudFiles.clientSecret"]          = cloudfiles_clientsecret
options["cloudFiles.resourceGroup"]         = cloudfiles_resourcegroup
options["cloudFiles.fetchParallelism"]      = 5
options["cloudFiles.resourceTag.streaming_job_autoloader_file_notification_enabled"]  = 'true'
options["cloudFiles.resourceTag.streaming_job_autoloader_stream_id"]  = 'some_id'
options["cloudFiles.queueName"]             = "some_pregenerated_queue"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 15:28:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/notifications-have-file-information-but-dataframe-is-empty-using/m-p/106110#M42386</guid>
      <dc:creator>Abdul-Mannan</dc:creator>
      <dc:date>2025-01-17T15:28:23Z</dc:date>
    </item>
    <item>
      <title>Re: Notifications have file information but dataframe is empty using autoloader file notification mo</title>
      <link>https://community.databricks.com/t5/data-engineering/notifications-have-file-information-but-dataframe-is-empty-using/m-p/106129#M42395</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Here are some potential steps and considerations to troubleshoot and resolve the issue:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Permissions and Configuration&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;&lt;SPAN&gt;Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles and permissions for the Azure Event Grid and Azure Queue Storage. The required roles include:&lt;/SPAN&gt;
&lt;UL class="_1t7bu9h8 _1t7bu9h2"&gt;
&lt;LI&gt;&lt;SPAN&gt;Contributor: For setting up resources in your storage account.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Storage Queue Data Contributor: For performing queue operations.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;EventGrid EventSubscription Contributor: For performing Event Grid subscription operations.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;Event Grid Registration&lt;/STRONG&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;&lt;SPAN&gt;Verify that Event Grid is registered as a Resource Provider in your Azure subscription. If not, you can register it through the Azure portal under the Resource Providers section.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;File Notification Events&lt;/STRONG&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;&lt;SPAN&gt;Check if the file notification events are being correctly generated and processed. For ADLS Gen2, Auto Loader listens for the &lt;CODE&gt;FlushWithClose&lt;/CODE&gt; event for processing a file. Ensure that this event is being triggered correctly.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;CloudFiles Options&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;Review the &lt;CODE&gt;cloudFiles&lt;/CODE&gt; options you have set. Ensure that all necessary options are correctly configured, including &lt;CODE&gt;cloudFiles.subscriptionId&lt;/CODE&gt;, &lt;CODE&gt;cloudFiles.tenantId&lt;/CODE&gt;, &lt;CODE&gt;cloudFiles.clientId&lt;/CODE&gt;, &lt;CODE&gt;cloudFiles.clientSecret&lt;/CODE&gt;, &lt;CODE&gt;cloudFiles.resourceGroup&lt;/CODE&gt;, and &lt;CODE&gt;cloudFiles.queueName&lt;/CODE&gt;.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Backfill Interval&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;&lt;SPAN&gt;Consider setting the &lt;CODE&gt;cloudFiles.backfillInterval&lt;/CODE&gt; option to trigger regular backfills. This can help ensure that all files are discovered within a given SLA if data completeness is a requirement&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Fri, 17 Jan 2025 18:46:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/notifications-have-file-information-but-dataframe-is-empty-using/m-p/106129#M42395</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-01-17T18:46:15Z</dc:date>
    </item>
  </channel>
</rss>

