cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader: how to avoid overlap in files

-werners-
Esteemed Contributor III

I'm thinking of using autoloader to process files being put on our data lake.

Let's say f.e. every 15 minutes, a parquet files is written. These files however contain overlapping data.

Now, every 2 hours I want to process the new data (autoloader) and merge into a delta lake table.

This seems pretty trivial, but unfortunately it is not:

when the autoloader fetches the new data, the streaming query will contain duplicate data of 2 types: actual dups (can be dropped with dropDuplicates), but also different versions of the same record (a record can be updated multiple times during a period of time). I want to process only the most recent version (based on a change date column).

For this last part, I don't see how I can fix this with a streaming query.

For batch, I would use a window function which partitions by the semantic key (id) and sorts on a timestamp.

But for streaming this is not possible.

So, any ideas?

Basically it is the 'spark streaming keep most recent record in group' issue.

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

What about forEachBatch and then MERGE?

Alternatively, run another process that will clean updates using the window function, as you said.

-werners-
Esteemed Contributor III

forEachBatch is an option, but then the merge will take a long time (merge per file).

Also (I forgot to mention that): a single file can also contain multiple versions of a single record.

Not using autoloader seems the way to go at the moment, but it would be nice if it is possible after all without a lot of overhead.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.