<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Get the list of loaded files from Autoloader in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33765#M24703</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We can use &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html?_ga=2.209987093.1207564024.1638788585-711712451.1635730674" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html?_ga=2.209987093.1207564024.1638788585-711712451.1635730674" target="_blank"&gt;Autoloader&lt;/A&gt; to track the files that have been loaded from S3 bucket or not. My question about Autoloader: &lt;B&gt;is there a way to read the Autoloader database to get the list of files that have been loaded?&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I can easily do this in AWS Glue job bookmark, but I'm not aware on how to do this in Databricks Autoloader.&lt;/P&gt;</description>
    <pubDate>Mon, 06 Dec 2021 11:12:04 GMT</pubDate>
    <dc:creator>herry</dc:creator>
    <dc:date>2021-12-06T11:12:04Z</dc:date>
    <item>
      <title>Get the list of loaded files from Autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33765#M24703</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We can use &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html?_ga=2.209987093.1207564024.1638788585-711712451.1635730674" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html?_ga=2.209987093.1207564024.1638788585-711712451.1635730674" target="_blank"&gt;Autoloader&lt;/A&gt; to track the files that have been loaded from S3 bucket or not. My question about Autoloader: &lt;B&gt;is there a way to read the Autoloader database to get the list of files that have been loaded?&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I can easily do this in AWS Glue job bookmark, but I'm not aware on how to do this in Databricks Autoloader.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Dec 2021 11:12:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33765#M24703</guid>
      <dc:creator>herry</dc:creator>
      <dc:date>2021-12-06T11:12:04Z</dc:date>
    </item>
    <item>
      <title>Re: Get the list of loaded files from Autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33766#M24704</link>
      <description>&lt;PRE&gt;&lt;CODE&gt;  .load("path")
  .withColumn("filePath",input_file_name())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;than you can for example insert filePath to your stream sink and than get distinct value from there or use forEatch / forEatchBatch and for example insert it into spark sql table&lt;/P&gt;</description>
      <pubDate>Mon, 06 Dec 2021 11:25:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33766#M24704</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-12-06T11:25:36Z</dc:date>
    </item>
    <item>
      <title>Re: Get the list of loaded files from Autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33767#M24705</link>
      <description>&lt;P&gt;Thank you! This works for me &lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏&lt;/span&gt; &lt;/P&gt;</description>
      <pubDate>Thu, 09 Dec 2021 15:55:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33767#M24705</guid>
      <dc:creator>herry</dc:creator>
      <dc:date>2021-12-09T15:55:31Z</dc:date>
    </item>
    <item>
      <title>Re: Get the list of loaded files from Autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33768#M24706</link>
      <description>&lt;P&gt;@Herry Ramli​&amp;nbsp;- Would you be happy to mark Hubert's answer as best so that other members can find the solution more easily?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 09 Dec 2021 16:12:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/33768#M24706</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-12-09T16:12:42Z</dc:date>
    </item>
    <item>
      <title>Re: Get the list of loaded files from Autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/81556#M36343</link>
      <description>&lt;P&gt;More efficient way&lt;/P&gt;&lt;P&gt;SELECT * FROM cloud_files_state('path/to/checkpoint');&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Aug 2024 21:03:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/get-the-list-of-loaded-files-from-autoloader/m-p/81556#M36343</guid>
      <dc:creator>kumar_ravi</dc:creator>
      <dc:date>2024-08-01T21:03:12Z</dc:date>
    </item>
  </channel>
</rss>

