Adding deduplication method to spark streaming

patojo94 · ‎05-04-2022

Hi everyone, I am having some troubles to add a deduplication step on a file streaming that is already running. The code I am trying to add is this one:

df = df.withWatermark("arrival_time", "20 minutes")\
.dropDuplicates(["event_id", "arrival_time"])

However, I am getting the following error.

Caused by: java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=101),dir = dbfs:/mnt/checkpoints/silver_events/state/0/101]: dbfs:/mnt/checkpoints/silver_events/state/0/101/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.

MI two questions are:

Why I am getting this error and what does it mean?
Is it really possible to delete a streamings checkpoint and not get duplicated data when restarting the streaming?

Thank you!