I'm encountering an issue in a pyspark code, where I'm calculating certain information monthly in a loop. The flow is pretty much as:
- Read input and create/read intermediate parquet files,
- Upsert records in intermediate parquet files with the monthly information from input
- Append records to an output parquet file
- Proceed to next month within the loop
Which 75% of the time fails at some random month, and 25% it succeeds. I just cannot figure out what causes the failure at 75% of the time. When it succeeds the run takes around ~40 mins and it processes around 180 months.
Here's the error log:
Py4JJavaError: An error occurred while calling o71002.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7856.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7856.0 (TID 77757) ( executor 12): org.apache.spark.SparkFileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https://xxx.dfs.core.windows.net/teamdata/Dev/Zones/Product/IM_AWL_Products/part-00001-tid-588925637..., PathNotFound, "The specified path does not exist. RequestId:ac6649bc-201f-000b-47f0-4a9296000000 Time:2024-12-10T10:47:46.7936012Z". [DEFAULT_FILE_NOT_FOUND] It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If disk cache is stale or the underlying files have been removed, you can invalidate disk cache manually by restarting the cluster. SQLSTATE: 42K03
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.