โ06-04-2023 01:16 PM
I have some Delta tables in our dev environment that started popping up with the following error today:
py4j.protocol.Py4JJavaError: An error occurred while calling o670.execute.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 104 in stage 1145.0 failed 4 times, most recent failure: Lost task 104.3 in stage 1145.0 (TID 2949) (10.111.21.215 executor 1): java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=104),dir = s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104]: s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.
These tables don't have an incredibly high write volume and just two weeks ago I ended up resetting the entire data lake in our dev/stage environment to deploy some new logic; which coincidentally corresponds to our current vacuum policy (i.e., 14 days).
This feels like less than a coincidence.
Is there a known issue with using vacuum on tables without a high write volume?
โ06-05-2023 07:58 AM
Hi @Jordan Yakerโ, The error message suggests an issue with the streaming state file in the HDFSStateStoreProvider. Specifically, it mentions that the file
s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. This could be due to the file being missing or not accessible.
To troubleshoot and resolve this issue, you can try the following steps:
Suppose the issue persists after attempting these steps. In that case, providing more information about your specific streaming job, the Spark version you are using, and any relevant stack traces or error logs may be helpful. This additional context can aid in further troubleshooting and providing more specific guidance.
โ06-05-2023 07:58 AM
Hi @Jordan Yakerโ, The error message suggests an issue with the streaming state file in the HDFSStateStoreProvider. Specifically, it mentions that the file
s3a://###################/offers-stage-1/checkpoints/offers-silver-stage1-pipeline/state/0/104/1.delta does not exist. This could be due to the file being missing or not accessible.
To troubleshoot and resolve this issue, you can try the following steps:
Suppose the issue persists after attempting these steps. In that case, providing more information about your specific streaming job, the Spark version you are using, and any relevant stack traces or error logs may be helpful. This additional context can aid in further troubleshooting and providing more specific guidance.
โ06-05-2023 12:06 PM
@Kaniz Fatmaโ
โ06-05-2023 12:10 PM
@Kaniz Fatmaโ I'm using DBR 11.3 which means PySpark 3.3.0.
Additionally, the full stack trace that I'm getting is attached to this reply.
โ06-06-2023 05:50 AM
Hi @Jordan Yakerโ ,
Verify File Retention: Ensure that the retention policy for your Delta tables is properly set. Check if there are any external processes or scripts that might be unintentionally removing the state files. Confirm that the retention period aligns with your requirements and that it is not causing the premature deletion of necessary files.
Check this wonderful thread on how to set the retention policy
Please get in touch with the Delta Lake community or support channels like the Delta Lake GitHub repository or forums to report the issue and seek further assistance. Provide detailed information about your setup, including Delta Lake version, Spark version, relevant configurations, and any relevant error logs. The community members and maintainers can provide insights and guidance specific to your setup.
Don't forget to share your findings, including troubleshooting steps and outcomes, with the relevant stakeholders and the Delta Lakes community. This information will be valuable for ongoing investigations and potential resolutions.
Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.
โ06-09-2023 07:32 PM
Hi @Jordan Yakerโ
We haven't heard from you since the last response from @Kaniz Fatmaโ , and I was checking back to see if her suggestions helped you.
Or else, If you have any solution, please share it with the community, as it can be helpful to others.
Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.