Hey guys,
I've been looking for some docs on how autoloader manages the source outage, I am currently running the following code:
dfBronze = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.schema(json_schema_bronze)
.load("myS3Source")\
.withColumn("file_path", col("_metadata.file_path")) \
.withColumn("ingestion_time", current_timestamp())\
.writeStream \
.format("delta") \
.option("checkpointLocation", checkpoint_dir_path_bronze) \
.outputMode("append") \
.trigger(availableNow=True) \ #i want to change this to .trigger(processingTime="1 second")
.start(bronze_table)
)
My question would be if i run this code will it attach to the cluster and permanently wait for file arrivals? even if the source streaming has an outage?:
Does the last screenshot mean that i will not run again unless i trigger it?
If I stop/detach the autoloader once it is run again will it sync all the files that arrived during the "autoloader outage".
I know last question is technically answered, but just want to make sure im understanding correctly.
thanks for the help