cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Has anyone else seen state files disappear in low-volume delta tables?

JordanYaker
Contributor

I have some Delta tables in our dev environment that started popping up with the following error today:

py4j.protocol.Py4JJavaError: An error occurred while calling o670.execute.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 104 in stage 1145.0 failed 4 times, most recent failure: Lost task 104.3 in stage 1145.0 (TID 2949) (10.111.21.215 executor 1): java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=104),dir = s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104]: s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.

These tables don't have an incredibly high write volume and just two weeks ago I ended up resetting the entire data lake in our dev/stage environment to deploy some new logic; which coincidentally corresponds to our current vacuum policy (i.e., 14 days).

This feels like less than a coincidence.

Is there a known issue with using vacuum on tables without a high write volume?

3 REPLIES 3

@Kaniz Fatmaโ€‹

  1. The file is indeed gone. Our permissions have not changed and everything is appropriate.
  2. The checkpoint locations have not changed and are still accessible with the proper permissions as I mentioned in item 1.
  3. Clearing the existing checkpoint locations is the only thing that works. This is not an acceptable long-term strategy however, because that means each of these pipelines will need to be re-processed and I'll be forever chasing my tail and deleting checkpoints with issues.
  4. I'm already managing the checkpoint locations manually in S3.
  5. I haven't manipulated the state provider configuration. It's all the default values.

@Kaniz Fatmaโ€‹ I'm using DBR 11.3 which means PySpark 3.3.0.

Additionally, the full stack trace that I'm getting is attached to this reply.

Anonymous
Not applicable

Hi @Jordan Yakerโ€‹ 

We haven't heard from you since the last response from @Kaniz Fatmaโ€‹ , and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group