topic Re: Random failure in the loop in pyspark in Data Engineering

Random failure in the loop in pyspark

bcsalay — Tue, 10 Dec 2024 10:53:09 GMT

Hi,

I'm encountering an issue in a pyspark code, where I'm calculating certain information monthly in a loop. The flow is pretty much as:

Read input and create/read intermediate parquet files,
Upsert records in intermediate parquet files with the monthly information from input
Append records to an output parquet file
Proceed to next month within the loop

Which 75% of the time fails at some random month, and 25% it succeeds. I just cannot figure out what causes the failure at 75% of the time. When it succeeds the run takes around ~40 mins and it processes around 180 months.

Here's the error log:

Py4JJavaError: An error occurred while calling o71002.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7856.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7856.0 (TID 77757) (10.2.222.135 executor 12): org.apache.spark.SparkFileNotFoundException: Operation failed: "The specified path does not exist.", 404, GET, https://xxx.dfs.core.windows.net/teamdata/Dev/Zones/Product/IM_AWL_Products/part-00001-tid-5889256371779762713-986e97fc-c887-4700-b842-6d4bd41c45d4-76340-1.c000.snappy.parquet?timeout=90, PathNotFound, "The specified path does not exist. RequestId:ac6649bc-201f-000b-47f0-4a9296000000 Time:2024-12-10T10:47:46.7936012Z". [DEFAULT_FILE_NOT_FOUND] It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If disk cache is stale or the underlying files have been removed, you can invalidate disk cache manually by restarting the cluster. SQLSTATE: 42K03

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Thanks.

Re: Random failure in the loop in pyspark

MuthuLakshmi — Tue, 10 Dec 2024 12:19:30 GMT

@bcsalay This could possibly be a cache issue.
Are you doing dataframe.cache() anywhere in your code? if so please follow that by using unpersist()

Also can you use the below config at cluster level
spark.databricks.io.cache.enabled false

Re: Random failure in the loop in pyspark

JacekLaskowski — Tue, 10 Dec 2024 19:27:21 GMT

Can you show some code to get the gist of what the code does? Are the parquet files accessed as a catalog table? Could it be that some other job makes changes to input tables?

Re: Random failure in the loop in pyspark

bcsalay — Tue, 17 Dec 2024 15:42:37 GMT

Hi @MuthuLakshmi thank you for your response. No I don't use df.cache() anywhere in the code. Yet I tried uncaching intermediate table which is read and updated within the loop, but it didn't help:

spark.catalog.uncacheTable("IM_AWL_Products")

I don't want to disable cluster level caching because we have an entire team running their code in same cluster. So my preference is to solve this more within the code.

Re: Random failure in the loop in pyspark

bcsalay — Tue, 17 Dec 2024 16:02:35 GMT

Hi @JacekLaskowski thank you for your response. No it is not a catalog table and not accessed/used by another job. I tried to explain above what the code does operationally, giving some more context: it is a development code, which processes historical data to create a certain business logic, and outputs are used to define a flag during statistical modelling, that's pretty much it. So it is not implemented anywhere yet, not a production code, just manually triggered by me or my team to create outputs.

I needed to write it with a loop because this is more convenient once this code is running in production, since business logic is built backward looking indefinitely in history, suppose a flag created in 2015 can impact next month's decision. Therefore code aggregates all historical information a row per product, and reads/updates it each month.

Hope this gives more clarity.