Databricks Community

JEAG · ‎10-19-2021

Hi, we are having this chain of errors every day in different files and processes:

An error occurred while calling o11255.parquet.

: org.apache.spark.SparkException: Job aborted.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 982.0 failed 4 times, most recent failure: Lost task 0.3 in stage 982.0 (TID 85705, 172.20.45.5, executor 31): org.apache.spark.SparkException: Task failed while writing rows.

Caused by: com.databricks.sql.io.FileReadException: Error while reading file dbfs: ... It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: java.io.IOException: Stream is closed!

Caused by: java.io.IOException: Stream is closed!

Caused by: java.io.FileNotFoundException: dbfs:/...

Now, we fix it deleting the file and running again the job, but we don´t know how to avoid the error

Any idea?

Thxs

JEAG · ‎10-19-2021

Hi @Kaniz Fatma, nice to meet you too!

I´m looking for a fix to this problem for many days and I found some similar questions in different forums included Databricks one but without any real solution.

For that reason I´ve created this question hoping to solve this asap

Thxs

-werners- · ‎10-19-2021

Can you elaborate a bit on the environment?

Is is a streaming job or batch? Where do you write to, S3, ADLS, ...?

Do you mount/unmount etc

Dan_Z · ‎10-19-2021

What's happening here is that Spark is reading a file and has a list of parquet file names that it wants to pull data from. Then, for one of the parquet files Spark goes to read in the file, but notices that that file does not actually exist in storage. So it throws this error.

Usually this is caused by some other process updating/deleting the files in this location while the read is taking place. I would look to see what else could be touching this location at the same time.

jose_gonzalez · ‎10-19-2021

hi @Jose Eliseo Aznarte Garcia ,

Like @Dan Zafar said, this is happening due to file updates/changes during your job execution. Do you delete data manually or drop and recreate tables in same place? I will highly recommend to use Delta instead. By using Delta, you will avoid this error.

Dan_Z · ‎10-19-2021

+1 to Delta!

JEAG · ‎10-20-2021

Thxs for your answers

About the enviroment, we are running batch jobs in Databricks Runtime Version 6.4, with Apache Spark 2.4.5 and our code is written in Python 3.7.6

Today we realize that all ours error are taking place in the same storage account, but in diferent files and diferent jobs as I told you before

Is it possible that the error could be a overload of the storage?

I find a file "_commited_vacuum" in the parquet directory which cause a error, what does it mean?

Anonymous · ‎10-20-2021

Vacuum means that Delta was removing files. It's important to not try to read Delta parquet files with the parquet reader as it will cause version problems. Are the tables backed by Delta?

A side note is that it's important to update to 3.2 as soon as possible. AQE in 3.0 release is going to fix a lot of bugs and speed up the queries too.

-werners- · ‎10-20-2021

It is not necessarily a delta table as you can also vacuum 'plain' spark tables:

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html#vacuum-a-spark-...

https://docs.databricks.com/spark/latest/spark-sql/dbio-commit.html#vacuum-spark

-werners- · ‎10-20-2021

vacuum is the cleaning up of uncommitted files. This happens automatically in databricks, but you can also trigger it manually.

My guess is that you have multiple jobs updating/deleting files in a parquet directory.

(As Dan an Jose mentioned).

Can you check this?

JEAG · ‎10-25-2021

Hi all

We move one of the processes to use the storage of a different Azure Account few days ago and the error that I reported has not happened again

I don´t think it was a coincidence so I conclude that the problem was related to some overload in the storage because I´m sure that our process don´t read and write the same file at the same time

databircks · ‎02-23-2022

Hi all,

I am also looking for a resolution of the same error. We are using DBR "9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)" and getting this error. We are reading and writing data from the same path but there are partitions inside the folder to differentiate the path. Is there any solution to this error?