I'm thinking of using autoloader to process files being put on our data lake.
Let's say f.e. every 15 minutes, a parquet files is written. These files however contain overlapping data.
Now, every 2 hours I want to process the new data (autoloader) and merge into a delta lake table.
This seems pretty trivial, but unfortunately it is not:
when the autoloader fetches the new data, the streaming query will contain duplicate data of 2 types: actual dups (can be dropped with dropDuplicates), but also different versions of the same record (a record can be updated multiple times during a period of time). I want to process only the most recent version (based on a change date column).
For this last part, I don't see how I can fix this with a streaming query.
For batch, I would use a window function which partitions by the semantic key (id) and sorts on a timestamp.
But for streaming this is not possible.
So, any ideas?
Basically it is the 'spark streaming keep most recent record in group' issue.