Re: Spark Streaming - only process new files in st...

Michael_Galli · ‎05-06-2022

Seems that "maxFileAge" solves the problem.

streaming_df = (

spark.readStream.format("cloudFiles").option("cloudFiles.format", "json") \

.option("maxFilesPerTrigger", 20) \

.option("multiLine", True) \

.option("maxFileAge", 1) \

.schema(schema).load(streaming_path)

)

This ignores files older than 1 week.

But how to ignore files older than 1 day?