<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: how does autoloader handle source outage in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-does-autoloader-handle-source-outage/m-p/88779#M37614</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/119002"&gt;@sakuraDev&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;&lt;SPAN&gt;Using the a&lt;/SPAN&gt;&lt;SPAN class=""&gt;vailableNow&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;trigger to process all available data immediately and then stop the query. As you noticed your data was processed once and now you need to trigger the process once again to process new files.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2. Changing the trigger to&amp;nbsp;&lt;SPAN class=""&gt;.trigger(processingTime="1 second")&lt;/SPAN&gt;&amp;nbsp;means that the streaming query will attempt to process any new files every second. If there are no new files due to a source outage, the query will not terminate; it will continue to check for new files at the specified interval.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Important consideration&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;The cluster would be running continuously. This means much bigger costs compared to running the process with availableNow trigger.&lt;/P&gt;&lt;P&gt;3. Don't worry about the outage. The whole idea of autoloader is the checkpointing mechanism. If you stop or detach the Auto Loader job and then restart it, the job will resume processing from where it left off.. The checkpointLocation option you've specified allows Auto Loader to keep track of which files have been processed. When the job is restarted, it will process any new files that arrived during the outage, ensuring no data is missed.&lt;/P&gt;</description>
    <pubDate>Thu, 05 Sep 2024 18:22:23 GMT</pubDate>
    <dc:creator>filipniziol</dc:creator>
    <dc:date>2024-09-05T18:22:23Z</dc:date>
    <item>
      <title>how does autoloader handle source outage</title>
      <link>https://community.databricks.com/t5/data-engineering/how-does-autoloader-handle-source-outage/m-p/88598#M37557</link>
      <description>&lt;P&gt;Hey guys,&lt;/P&gt;&lt;P&gt;I've been looking for some docs on how autoloader manages the source outage, I am currently running the following code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;dfBronze = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .schema(json_schema_bronze)
    .load("myS3Source")\
    .withColumn("file_path", col("_metadata.file_path")) \
    .withColumn("ingestion_time", current_timestamp())\
    .writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_dir_path_bronze) \
    .outputMode("append") \
    .trigger(availableNow=True) \ #i want to change this to .trigger(processingTime="1 second")
    .start(bronze_table)
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question would be if i run this code will it attach to the cluster and permanently wait for file arrivals? even if the source streaming has an outage?:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="sakuraDev_0-1725478024362.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10931iB42BC13595A304CA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="sakuraDev_0-1725478024362.png" alt="sakuraDev_0-1725478024362.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Does the last screenshot mean that i will not run again unless i trigger it?&lt;/P&gt;&lt;P&gt;If I stop/detach the autoloader once it is run again will it sync all the files that arrived during the "autoloader outage".&lt;BR /&gt;&lt;BR /&gt;I know last question is technically answered, but just want to make sure im understanding correctly.&lt;/P&gt;&lt;P&gt;thanks for the help&lt;/P&gt;</description>
      <pubDate>Wed, 04 Sep 2024 19:28:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-does-autoloader-handle-source-outage/m-p/88598#M37557</guid>
      <dc:creator>sakuraDev</dc:creator>
      <dc:date>2024-09-04T19:28:20Z</dc:date>
    </item>
    <item>
      <title>Re: how does autoloader handle source outage</title>
      <link>https://community.databricks.com/t5/data-engineering/how-does-autoloader-handle-source-outage/m-p/88779#M37614</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/119002"&gt;@sakuraDev&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;&lt;SPAN&gt;Using the a&lt;/SPAN&gt;&lt;SPAN class=""&gt;vailableNow&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;trigger to process all available data immediately and then stop the query. As you noticed your data was processed once and now you need to trigger the process once again to process new files.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2. Changing the trigger to&amp;nbsp;&lt;SPAN class=""&gt;.trigger(processingTime="1 second")&lt;/SPAN&gt;&amp;nbsp;means that the streaming query will attempt to process any new files every second. If there are no new files due to a source outage, the query will not terminate; it will continue to check for new files at the specified interval.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Important consideration&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;The cluster would be running continuously. This means much bigger costs compared to running the process with availableNow trigger.&lt;/P&gt;&lt;P&gt;3. Don't worry about the outage. The whole idea of autoloader is the checkpointing mechanism. If you stop or detach the Auto Loader job and then restart it, the job will resume processing from where it left off.. The checkpointLocation option you've specified allows Auto Loader to keep track of which files have been processed. When the job is restarted, it will process any new files that arrived during the outage, ensuring no data is missed.&lt;/P&gt;</description>
      <pubDate>Thu, 05 Sep 2024 18:22:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-does-autoloader-handle-source-outage/m-p/88779#M37614</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-05T18:22:23Z</dc:date>
    </item>
  </channel>
</rss>

