Has anyone else seen state files disappear in low-volume delta tables?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-04-2023 01:16 PM
I have some Delta tables in our dev environment that started popping up with the following error today:
py4j.protocol.Py4JJavaError: An error occurred while calling o670.execute.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 104 in stage 1145.0 failed 4 times, most recent failure: Lost task 104.3 in stage 1145.0 (TID 2949) (10.111.21.215 executor 1): java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=104),dir = s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104]: s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.
These tables don't have an incredibly high write volume and just two weeks ago I ended up resetting the entire data lake in our dev/stage environment to deploy some new logic; which coincidentally corresponds to our current vacuum policy (i.e., 14 days).
This feels like less than a coincidence.
Is there a known issue with using vacuum on tables without a high write volume?
- Labels:
-
Delta
-
VACUUM Command
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-05-2023 12:06 PM
@Kaniz Fatma
- The file is indeed gone. Our permissions have not changed and everything is appropriate.
- The checkpoint locations have not changed and are still accessible with the proper permissions as I mentioned in item 1.
- Clearing the existing checkpoint locations is the only thing that works. This is not an acceptable long-term strategy however, because that means each of these pipelines will need to be re-processed and I'll be forever chasing my tail and deleting checkpoints with issues.
- I'm already managing the checkpoint locations manually in S3.
- I haven't manipulated the state provider configuration. It's all the default values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-05-2023 12:10 PM
@Kaniz Fatma I'm using DBR 11.3 which means PySpark 3.3.0.
Additionally, the full stack trace that I'm getting is attached to this reply.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-09-2023 07:32 PM
Hi @Jordan Yaker
We haven't heard from you since the last response from @Kaniz Fatma , and I was checking back to see if her suggestions helped you.
Or else, If you have any solution, please share it with the community, as it can be helpful to others.
Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

