Hi everyone, I am having some troubles to add a deduplication step on a file streaming that is already running. The code I am trying to add is this one:
df = df.withWatermark("arrival_time", "20 minutes")\
.dropDuplicates(["event_id", "arrival_time"])
However, I am getting the following error.
Caused by: java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=101),dir = dbfs:/mnt/checkpoints/silver_events/state/0/101]: dbfs:/mnt/checkpoints/silver_events/state/0/101/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.
MI two questions are:
- Why I am getting this error and what does it mean?
- Is it really possible to delete a streamings checkpoint and not get duplicated data when restarting the streaming?
Thank you!