cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Error writing parquet files

JEAG
New Contributor III

Hi, we are having this chain of errors every day in different files and processes:

An error occurred while calling o11255.parquet.

: org.apache.spark.SparkException: Job aborted.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 982.0 failed 4 times, most recent failure: Lost task 0.3 in stage 982.0 (TID 85705, 172.20.45.5, executor 31): org.apache.spark.SparkException: Task failed while writing rows.

Caused by: com.databricks.sql.io.FileReadException: Error while reading file dbfs: ... It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: java.io.IOException: Stream is closed!

Caused by: java.io.IOException: Stream is closed!

Caused by: java.io.FileNotFoundException: dbfs:/...

Now, we fix it deleting the file and running again the job, but we donยดt know how to avoid the error

Any idea?

Thxs

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Jose Eliseo Aznarte Garciaโ€‹ ,

This is expected behaviour when you update some rows in the table and immediately query the table.

From the error message: 

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running the 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

To resolve this issue, refresh all cached entries that are associated with the table.

REFRESH TABLE [db_name.]table_name

Refresh all cached entries associated with the table.

If the table was previously cached, then it would be cached lazily the next time it is scanned.

View solution in original post

15 REPLIES 15

Kaniz
Community Manager
Community Manager

Hi @JEAG! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

JEAG
New Contributor III

Hi @Kaniz Fatmaโ€‹, nice to meet you too!

Iยดm looking for a fix to this problem for many days and I found some similar questions in different forums included Databricks one but without any real solution.

For that reason Iยดve created this question hoping to solve this asap

Thxs

-werners-
Esteemed Contributor III

Can you elaborate a bit on the environment?

Is is a streaming job or batch? Where do you write to, S3, ADLS, ...?

Do you mount/unmount etc

Dan_Z
Honored Contributor
Honored Contributor

What's happening here is that Spark is reading a file and has a list of parquet file names that it wants to pull data from. Then, for one of the parquet files Spark goes to read in the file, but notices that that file does not actually exist in storage. So it throws this error.

Usually this is caused by some other process updating/deleting the files in this location while the read is taking place. I would look to see what else could be touching this location at the same time.

jose_gonzalez
Moderator
Moderator

hi @Jose Eliseo Aznarte Garciaโ€‹ ,

Like @Dan Zafarโ€‹ said, this is happening due to file updates/changes during your job execution. Do you delete data manually or drop and recreate tables in same place? I will highly recommend to use Delta instead. By using Delta, you will avoid this error.

Dan_Z
Honored Contributor
Honored Contributor

+1 to Delta!

JEAG
New Contributor III

Thxs for your answers

About the enviroment, we are running batch jobs in Databricks Runtime Version 6.4, with Apache Spark 2.4.5 and our code is written in Python 3.7.6

Today we realize that all ours error are taking place in the same storage account, but in diferent files and diferent jobs as I told you before

Is it possible that the error could be a overload of the storage?

I find a file "_commited_vacuum" in the parquet directory which cause a error, what does it mean?

Anonymous
Not applicable

Vacuum means that Delta was removing files. It's important to not try to read Delta parquet files with the parquet reader as it will cause version problems. Are the tables backed by Delta?

A side note is that it's important to update to 3.2 as soon as possible. AQE in 3.0 release is going to fix a lot of bugs and speed up the queries too.

-werners-
Esteemed Contributor III

vacuum is the cleaning up of uncommitted files. This happens automatically in databricks, but you can also trigger it manually.

My guess is that you have multiple jobs updating/deleting files in a parquet directory.

(As Dan an Jose mentioned).

Can you check this?

JEAG
New Contributor III

Hi all

We move one of the processes to use the storage of a different Azure Account few days ago and the error that I reported has not happened again

I donยดt think it was a coincidence so I conclude that the problem was related to some overload in the storage because Iยดm sure that our process donยดt read and write the same file at the same time

databircks
New Contributor II

Hi all,

I am also looking for a resolution of the same error. We are using DBR "9.1 LTS ML (includes Apache Spark 3.1.2, Scala 2.12)" and getting this error. We are reading and writing data from the same path but there are partitions inside the folder to differentiate the path. Is there any solution to this error?โ€‹ 

Kaniz
Community Manager
Community Manager

Hi @Bhavsik Ahirโ€‹ , Can you paste the error stack here?

databircks
New Contributor II

Hi @Kaniz Fatmaโ€‹ , Here I am sharing the error log.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.