Spark Streaming - only process new files in stream...

Michael_Galli · ‎05-06-2022

In our streaming jobs, we currently run streaming (cloudFiles format) on a directory with sales transactions coming every 5 minutes.

In this directory, the transactions are ordered in the following format:

<streaming-checkpoint-root>/<transaction_date>/<transaction_hour>/transaction_x_y.json

Only the transactions of TODAY are of interest, all others are already obsolete.

When I start the streaming job, it will process all the historical transactions, which I don´t want.

Is it somehow possible to process only NEW files coming in after the streaming job has started?

Spark Streaming - only process new files in streaming path?