cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?

MarcLimotte
New Contributor II

I have a fairly small, simple DataFrame, month:

month.schema

org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), StructField(real_month,TimestampType,true), StructField(month_millis,LongType,true))

The month Dataframe is derived from a Dataframe originally created from a RDD which comes from a sc.parallelize(...).

I try to save it as a table:

month.write.mode(SaveMode.Overwrite).saveAsTable("month_x2")

And I get an exception. The root cause seems to be (also see full stacktrace below):

Caused by: java.io.IOException: File already exists: /databricks-prod-storage-virginia/dbc-44061e6b-9dd3/0/user/hive/warehouse/month_x2/part-r-00002-9858e235-1c6c-4276-800d-18c8a760a416.gz.parquet

I've restarted the cluster, and reran the notebook and get the same result every time. I'm using Overwrite mode (although, I think non-overwrite produces a different error anyway). Also, I get this error even when I change the tableName (i.e. even on the first saveAsTable call for a given name).

Error is in attached file

12 REPLIES 12

MarcLimotte
New Contributor II

BTW, I'm on a Spark 1.4 Databricks cluster.

vida
Contributor II

Hi,

If you run:

dbutils.fs.rm("dbfs:/user/hive/warehouse/month_x2/", true)

before you do the

saveAsTable
, your command should execute as you'd like.

-V

MarcLimotte
New Contributor II

It seems to work as I expect now, even without doing the explicit dbutils.fs.rm(...). Must have been some intermittent problem.

MattHobby
New Contributor II

I keep experiencing this same problem - it doesn't occur all the time and I assume it is based around an S3 sync problem? Do we know more details or a fix?

vida
Contributor II

Hi,

Is it possible that you tried to create that table before and that failed? Or even that there was a failure this time in creating the table? Our open source team seems this problem sometimes, and the error message is misleading. Basically, there may be a run when you try to create the table, but that fails. The file created by the failed task gets uploaded to S3, and then any retries will see that file and report that the file already exists. I suggest two best practices for preventing this:

1) Make sure you get rid of possible corrupt files.

a) Always blindly delete the table directory when you want to overwrite it in case there are leftover corrupt files.

b) Wrap your table creation in a try-catch block. If it fails, catch the exception and clean up the folder.

2) When you do get this table write error - you should go to the Spark cluster UI and drill down to the task that failed to understand the real error. Just relying on the error message in the notebook is not enough.

JungKim
New Contributor II

Hi I'm getting this error as well.

I have tried deleting and confirmed deletion but it did not solve my issue. It seems that last line of code is at the databrick's S3AFileSystem implementation: "com.databricks.s3a.S3AFileSystem.create(S3AFileSystem.java:452)"

This error is not intermittent for me and consistent for a dataframe.

FYI, All other dataframes except this one particular frame get written to parquet correctly. They all have 25 partitions coming from same data source, its just at the different segment of a table.

Write code:

df    .write    .mode(SaveMode.Overwrite)    .parquet(s3Prefix + s"${name}.parquet")

Full stack trace:

Full Stack Trace File

ReKa
New Contributor III

see the same problem frequently despite brute force rm and changing table_name

ThuyINACTIVETra
New Contributor II

Hi,

I get this exception when using df.write.parquet(), on both overwrite and default mode, for completely new location.

The exception is intermittent and causes our data pipeline crash randomly.

Spark version: 1.6.0

Has anyone have more information about this?

ReKa
New Contributor III

In a similar problem following fixed the problem:

- Using Memory Optimised Nodes (Compute Optimised had problems)

- Tighter definition of schema (specially for nested clusters in pyspark, where order may matter)

- Using S3a mount instead of S3n mounts

- Using Hadooop 2 and Latest DB Spark 1.61

- Also could avoid problem partially by saving as json and converting to parquet at the end (But watch for zero-sized files which can show corrupt partitions)

ThuyINACTIVETra
New Contributor II

@ReKa​  thank heaps.

I'm using S3a already, and schema is very clearly defined and look like this:

OUTPUT_SCHEMA = StructType([ 
    StructField("c1", StringType(), True), 
    StructField("c2", ArrayType(StringType()), True), 
    StructField("c3", ShortType(), True), 
    StructField("c4", BooleanType(), True) 
])

I think this schema is tight enough.

On the notes:

+ "Compute Optimised had problems": do you what types of problems it has? or only writing data?

+ json and converting to parquet at the end (But watch for zero-sized files which can show corrupt partitions): do you have more information about this?

Many thanks.

ReKa
New Contributor III

Your schema is tight, but make sure that the conversion to it does not throw an exception.

Try with Memory Optimized Nodes, you may be fine.

My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table. In my case, the main bottle-neck was moving data inside AWS (from S3 to spark nodes)

df.write.mode('overwrite').format('parquet').saveAsTable(new_name) #change parquet to jsonsonce When your job is finished look at the hive directory for above table and see how many files are 0 sized.

@Reka, thank you. I have exception in converting data type. Hopefully the issue won't happen again after the fix.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group