cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Has anyone else seen state files disappear in low-volume delta tables?

JordanYaker
Contributor

I have some Delta tables in our dev environment that started popping up with the following error today:

py4j.protocol.Py4JJavaError: An error occurred while calling o670.execute.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 104 in stage 1145.0 failed 4 times, most recent failure: Lost task 104.3 in stage 1145.0 (TID 2949) (10.111.21.215 executor 1): java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=104),dir = s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104]: s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.

These tables don't have an incredibly high write volume and just two weeks ago I ended up resetting the entire data lake in our dev/stage environment to deploy some new logic; which coincidentally corresponds to our current vacuum policy (i.e., 14 days).

This feels like less than a coincidence.

Is there a known issue with using vacuum on tables without a high write volume?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Jordan Yaker​, The error message suggests an issue with the streaming state file in the HDFSStateStoreProvider. Specifically, it mentions that the file

s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. This could be due to the file being missing or not accessible.

To troubleshoot and resolve this issue, you can try the following steps:

  1. Verify the file path: Double-check that the file path s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta is correct and accessible. Ensure that the necessary permissions are in place to access the file in your S3 bucket.
  2. Check checkpoint location: Ensure that the location specified in your streaming job is correct and accessible. If the checkpoint location has been modified or relocated, you may need to update the configuration accordingly.
  3. Precise existing checkpoint location: If the stream job is being restarted with a new or updated state operation, you might need to clear the existing checkpoint location. This can be done by manually deleting the checkpoint files or changing the checkpoint location to a new directory.
  4. Create a new checkpoint location: If you have cleared the existing one, you can create a new checkpoint directory and specify it in your streaming job configuration.
  5. Review stream job configuration: Double-check the configuration settings for your streaming job, including the state store provider and checkpoint location. Ensure that all configurations are correctly set up and match the intended behavior.

Suppose the issue persists after attempting these steps. In that case, providing more information about your specific streaming job, the Spark version you are using, and any relevant stack traces or error logs may be helpful. This additional context can aid in further troubleshooting and providing more specific guidance.

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @Jordan Yaker​, The error message suggests an issue with the streaming state file in the HDFSStateStoreProvider. Specifically, it mentions that the file

s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. This could be due to the file being missing or not accessible.

To troubleshoot and resolve this issue, you can try the following steps:

  1. Verify the file path: Double-check that the file path s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta is correct and accessible. Ensure that the necessary permissions are in place to access the file in your S3 bucket.
  2. Check checkpoint location: Ensure that the location specified in your streaming job is correct and accessible. If the checkpoint location has been modified or relocated, you may need to update the configuration accordingly.
  3. Precise existing checkpoint location: If the stream job is being restarted with a new or updated state operation, you might need to clear the existing checkpoint location. This can be done by manually deleting the checkpoint files or changing the checkpoint location to a new directory.
  4. Create a new checkpoint location: If you have cleared the existing one, you can create a new checkpoint directory and specify it in your streaming job configuration.
  5. Review stream job configuration: Double-check the configuration settings for your streaming job, including the state store provider and checkpoint location. Ensure that all configurations are correctly set up and match the intended behavior.

Suppose the issue persists after attempting these steps. In that case, providing more information about your specific streaming job, the Spark version you are using, and any relevant stack traces or error logs may be helpful. This additional context can aid in further troubleshooting and providing more specific guidance.

@Kaniz Fatma​

  1. The file is indeed gone. Our permissions have not changed and everything is appropriate.
  2. The checkpoint locations have not changed and are still accessible with the proper permissions as I mentioned in item 1.
  3. Clearing the existing checkpoint locations is the only thing that works. This is not an acceptable long-term strategy however, because that means each of these pipelines will need to be re-processed and I'll be forever chasing my tail and deleting checkpoints with issues.
  4. I'm already managing the checkpoint locations manually in S3.
  5. I haven't manipulated the state provider configuration. It's all the default values.

@Kaniz Fatma​ I'm using DBR 11.3 which means PySpark 3.3.0.

Additionally, the full stack trace that I'm getting is attached to this reply.

Kaniz
Community Manager
Community Manager

Hi @Jordan Yaker​ ,

Verify File Retention: Ensure that the retention policy for your Delta tables is properly set. Check if there are any external processes or scripts that might be unintentionally removing the state files. Confirm that the retention period aligns with your requirements and that it is not causing the premature deletion of necessary files.

Check this wonderful thread on how to set the retention policy

Please get in touch with the Delta Lake community or support channels like the Delta Lake GitHub repository or forums to report the issue and seek further assistance. Provide detailed information about your setup, including Delta Lake version, Spark version, relevant configurations, and any relevant error logs. The community members and maintainers can provide insights and guidance specific to your setup.

Don't forget to share your findings, including troubleshooting steps and outcomes, with the relevant stakeholders and the Delta Lakes community. This information will be valuable for ongoing investigations and potential resolutions.

Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Anonymous
Not applicable

Hi @Jordan Yaker​ 

We haven't heard from you since the last response from @Kaniz Fatma​ , and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!