10-19-2021 01:44 AM
Hi, we are having this chain of errors every day in different files and processes:
An error occurred while calling o11255.parquet.
: org.apache.spark.SparkException: Job aborted.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 982.0 failed 4 times, most recent failure: Lost task 0.3 in stage 982.0 (TID 85705, 172.20.45.5, executor 31): org.apache.spark.SparkException: Task failed while writing rows.
Caused by: com.databricks.sql.io.FileReadException: Error while reading file dbfs: ... It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: java.io.IOException: Stream is closed!
Caused by: java.io.IOException: Stream is closed!
Caused by: java.io.FileNotFoundException: dbfs:/...
Now, we fix it deleting the file and running again the job, but we don´t know how to avoid the error
Any idea?
Thxs
10-19-2021 03:36 AM
Hi @Kaniz Fatma, nice to meet you too!
I´m looking for a fix to this problem for many days and I found some similar questions in different forums included Databricks one but without any real solution.
For that reason I´ve created this question hoping to solve this asap
Thxs
10-19-2021 08:16 AM
Can you elaborate a bit on the environment?
Is is a streaming job or batch? Where do you write to, S3, ADLS, ...?
Do you mount/unmount etc
10-19-2021 09:19 AM
What's happening here is that Spark is reading a file and has a list of parquet file names that it wants to pull data from. Then, for one of the parquet files Spark goes to read in the file, but notices that that file does not actually exist in storage. So it throws this error.
Usually this is caused by some other process updating/deleting the files in this location while the read is taking place. I would look to see what else could be touching this location at the same time.
10-19-2021 09:43 AM
hi @Jose Eliseo Aznarte Garcia ,
Like @Dan Zafar said, this is happening due to file updates/changes during your job execution. Do you delete data manually or drop and recreate tables in same place? I will highly recommend to use Delta instead. By using Delta, you will avoid this error.
10-19-2021 09:45 AM
+1 to Delta!
10-20-2021 02:10 AM
Thxs for your answers
About the enviroment, we are running batch jobs in Databricks Runtime Version 6.4, with Apache Spark 2.4.5 and our code is written in Python 3.7.6
Today we realize that all ours error are taking place in the same storage account, but in diferent files and diferent jobs as I told you before
Is it possible that the error could be a overload of the storage?
I find a file "_commited_vacuum" in the parquet directory which cause a error, what does it mean?
10-20-2021 06:01 AM
Vacuum means that Delta was removing files. It's important to not try to read Delta parquet files with the parquet reader as it will cause version problems. Are the tables backed by Delta?
A side note is that it's important to update to 3.2 as soon as possible. AQE in 3.0 release is going to fix a lot of bugs and speed up the queries too.
10-20-2021 06:31 AM
It is not necessarily a delta table as you can also vacuum 'plain' spark tables:
https://docs.databricks.com/spark/latest/spark-sql/dbio-commit.html#vacuum-spark
10-20-2021 02:21 AM
vacuum is the cleaning up of uncommitted files. This happens automatically in databricks, but you can also trigger it manually.
My guess is that you have multiple jobs updating/deleting files in a parquet directory.
(As Dan an Jose mentioned).
Can you check this?
10-25-2021 09:21 AM
Hi all
We move one of the processes to use the storage of a different Azure Account few days ago and the error that I reported has not happened again
I don´t think it was a coincidence so I conclude that the problem was related to some overload in the storage because I´m sure that our process don´t read and write the same file at the same time
02-23-2022 02:53 AM
Hi all,
I am also looking for a resolution of the same error. We are using DBR "9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)" and getting this error. We are reading and writing data from the same path but there are partitions inside the folder to differentiate the path. Is there any solution to this error?
04-11-2022 05:12 AM
Hi @Kaniz Fatma , Here I am sharing the error log.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group