Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-21-2015 07:10 AM
I have a fairly small, simple DataFrame, month:
month.schema
org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), StructField(real_month,TimestampType,true), StructField(month_millis,LongType,true))
The month Dataframe is derived from a Dataframe originally created from a RDD which comes from a sc.parallelize(...).
I try to save it as a table:
month.write.mode(SaveMode.Overwrite).saveAsTable("month_x2")
And I get an exception. The root cause seems to be (also see full stacktrace below):
Caused by: java.io.IOException: File already exists: /databricks-prod-storage-virginia/dbc-44061e6b-9dd3/0/user/hive/warehouse/month_x2/part-r-00002-9858e235-1c6c-4276-800d-18c8a760a416.gz.parquet
I've restarted the cluster, and reran the notebook and get the same result every time. I'm using Overwrite mode (although, I think non-overwrite produces a different error anyway). Also, I get this error even when I change the tableName (i.e. even on the first saveAsTable call for a given name).
Error is in attached file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-21-2015 07:19 AM
BTW, I'm on a Spark 1.4 Databricks cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2015 10:26 AM
Hi,
If you run:
dbutils.fs.rm("dbfs:/user/hive/warehouse/month_x2/", true)
before you do the
saveAsTable
, your command should execute as you'd like.
-V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-24-2015 10:34 AM
It seems to work as I expect now, even without doing the explicit dbutils.fs.rm(...). Must have been some intermittent problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2015 04:18 AM
I keep experiencing this same problem - it doesn't occur all the time and I assume it is based around an S3 sync problem? Do we know more details or a fix?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2015 01:19 PM
Hi,
Is it possible that you tried to create that table before and that failed? Or even that there was a failure this time in creating the table? Our open source team seems this problem sometimes, and the error message is misleading. Basically, there may be a run when you try to create the table, but that fails. The file created by the failed task gets uploaded to S3, and then any retries will see that file and report that the file already exists. I suggest two best practices for preventing this:
1) Make sure you get rid of possible corrupt files.
a) Always blindly delete the table directory when you want to overwrite it in case there are leftover corrupt files.
b) Wrap your table creation in a try-catch block. If it fails, catch the exception and clean up the folder.
2) When you do get this table write error - you should go to the Spark cluster UI and drill down to the task that failed to understand the real error. Just relying on the error message in the notebook is not enough.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-02-2016 02:03 PM
Hi I'm getting this error as well.
I have tried deleting and confirmed deletion but it did not solve my issue. It seems that last line of code is at the databrick's S3AFileSystem implementation: "com.databricks.s3a.S3AFileSystem.create(S3AFileSystem.java:452)"
This error is not intermittent for me and consistent for a dataframe.
FYI, All other dataframes except this one particular frame get written to parquet correctly. They all have 25 partitions coming from same data source, its just at the different segment of a table.
Write code:
df .write .mode(SaveMode.Overwrite) .parquet(s3Prefix + s"${name}.parquet")
Full stack trace:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2016 03:26 PM
see the same problem frequently despite brute force rm and changing table_name
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-01-2016 09:56 PM
Hi,
I get this exception when using df.write.parquet(), on both overwrite and default mode, for completely new location.
The exception is intermittent and causes our data pipeline crash randomly.
Spark version: 1.6.0
Has anyone have more information about this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-01-2016 11:09 PM
In a similar problem following fixed the problem:
- Using Memory Optimised Nodes (Compute Optimised had problems)
- Tighter definition of schema (specially for nested clusters in pyspark, where order may matter)
- Using S3a mount instead of S3n mounts
- Using Hadooop 2 and Latest DB Spark 1.61
- Also could avoid problem partially by saving as json and converting to parquet at the end (But watch for zero-sized files which can show corrupt partitions)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-01-2016 11:21 PM
@ReKa thank heaps.
I'm using S3a already, and schema is very clearly defined and look like this:
OUTPUT_SCHEMA = StructType([
StructField("c1", StringType(), True),
StructField("c2", ArrayType(StringType()), True),
StructField("c3", ShortType(), True),
StructField("c4", BooleanType(), True)
])
I think this schema is tight enough.
On the notes:
+ "Compute Optimised had problems": do you what types of problems it has? or only writing data?
+ json and converting to parquet at the end (But watch for zero-sized files which can show corrupt partitions): do you have more information about this?
Many thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2016 06:25 AM
Your schema is tight, but make sure that the conversion to it does not throw an exception.
Try with Memory Optimized Nodes, you may be fine.
My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table. In my case, the main bottle-neck was moving data inside AWS (from S3 to spark nodes)
df.write.mode('overwrite').format('parquet').saveAsTable(new_name) #change parquet to jsonsonce When your job is finished look at the hive directory for above table and see how many files are 0 sized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-03-2016 01:08 PM
@Reka, thank you. I have exception in converting data type. Hopefully the issue won't happen again after the fix.