how does autoloader handle source outage

sakuraDev · ‎09-04-2024

Hey guys,

I've been looking for some docs on how autoloader manages the source outage, I am currently running the following code:

dfBronze = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .schema(json_schema_bronze)
    .load("myS3Source")\
    .withColumn("file_path", col("_metadata.file_path")) \
    .withColumn("ingestion_time", current_timestamp())\
    .writeStream \
    .format("delta") \
    .option("checkpointLocation", checkpoint_dir_path_bronze) \
    .outputMode("append") \
    .trigger(availableNow=True) \ #i want to change this to .trigger(processingTime="1 second")
    .start(bronze_table)
)

My question would be if i run this code will it attach to the cluster and permanently wait for file arrivals? even if the source streaming has an outage?:

Does the last screenshot mean that i will not run again unless i trigger it?

If I stop/detach the autoloader once it is run again will it sync all the files that arrived during the "autoloader outage".

I know last question is technically answered, but just want to make sure im understanding correctly.

thanks for the help