Databricks Community

JordanYaker · ‎06-04-2023

I have some Delta tables in our dev environment that started popping up with the following error today:

py4j.protocol.Py4JJavaError: An error occurred while calling o670.execute.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 104 in stage 1145.0 failed 4 times, most recent failure: Lost task 104.3 in stage 1145.0 (TID 2949) (10.111.21.215 executor 1): java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=104),dir = s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104]: s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.

These tables don't have an incredibly high write volume and just two weeks ago I ended up resetting the entire data lake in our dev/stage environment to deploy some new logic; which coincidentally corresponds to our current vacuum policy (i.e., 14 days).

This feels like less than a coincidence.

Is there a known issue with using vacuum on tables without a high write volume?

JordanYaker · ‎06-05-2023

@Kaniz Fatma

The file is indeed gone. Our permissions have not changed and everything is appropriate.
The checkpoint locations have not changed and are still accessible with the proper permissions as I mentioned in item 1.
Clearing the existing checkpoint locations is the only thing that works. This is not an acceptable long-term strategy however, because that means each of these pipelines will need to be re-processed and I'll be forever chasing my tail and deleting checkpoints with issues.
I'm already managing the checkpoint locations manually in S3.
I haven't manipulated the state provider configuration. It's all the default values.