Databricks

MarcLimotte · ‎07-21-2015

I have a fairly small, simple DataFrame, month:

month.schema

org.apache.spark.sql.types.StructType = StructType(StructField(month,DateType,true), StructField(real_month,TimestampType,true), StructField(month_millis,LongType,true))

The month Dataframe is derived from a Dataframe originally created from a RDD which comes from a sc.parallelize(...).

I try to save it as a table:

month.write.mode(SaveMode.Overwrite).saveAsTable("month_x2")

And I get an exception. The root cause seems to be (also see full stacktrace below):

Caused by: java.io.IOException: File already exists: /databricks-prod-storage-virginia/dbc-44061e6b-9dd3/0/user/hive/warehouse/month_x2/part-r-00002-9858e235-1c6c-4276-800d-18c8a760a416.gz.parquet

I've restarted the cluster, and reran the notebook and get the same result every time. I'm using Overwrite mode (although, I think non-overwrite produces a different error anyway). Also, I get this error even when I change the tableName (i.e. even on the first saveAsTable call for a given name).

Error is in attached file

MarcLimotte · ‎07-21-2015

BTW, I'm on a Spark 1.4 Databricks cluster.

vida · ‎07-24-2015

Hi,

If you run:

dbutils.fs.rm("dbfs:/user/hive/warehouse/month_x2/", true)

before you do the

saveAsTable

, your command should execute as you'd like.

-V

MarcLimotte · ‎07-24-2015

It seems to work as I expect now, even without doing the explicit dbutils.fs.rm(...). Must have been some intermittent problem.

MattHobby · ‎10-29-2015

I keep experiencing this same problem - it doesn't occur all the time and I assume it is based around an S3 sync problem? Do we know more details or a fix?

vida · ‎10-29-2015

Hi,

Is it possible that you tried to create that table before and that failed? Or even that there was a failure this time in creating the table? Our open source team seems this problem sometimes, and the error message is misleading. Basically, there may be a run when you try to create the table, but that fails. The file created by the failed task gets uploaded to S3, and then any retries will see that file and report that the file already exists. I suggest two best practices for preventing this:

1) Make sure you get rid of possible corrupt files.

a) Always blindly delete the table directory when you want to overwrite it in case there are leftover corrupt files.

b) Wrap your table creation in a try-catch block. If it fails, catch the exception and clean up the folder.

2) When you do get this table write error - you should go to the Spark cluster UI and drill down to the task that failed to understand the real error. Just relying on the error message in the notebook is not enough.

JungKim · ‎03-02-2016

Hi I'm getting this error as well.

I have tried deleting and confirmed deletion but it did not solve my issue. It seems that last line of code is at the databrick's S3AFileSystem implementation: "com.databricks.s3a.S3AFileSystem.create(S3AFileSystem.java:452)"

This error is not intermittent for me and consistent for a dataframe.

FYI, All other dataframes except this one particular frame get written to parquet correctly. They all have 25 partitions coming from same data source, its just at the different segment of a table.

Write code:

df    .write    .mode(SaveMode.Overwrite)    .parquet(s3Prefix + s"${name}.parquet")

Full stack trace:

Full Stack Trace File

ReKa · ‎03-25-2016

see the same problem frequently despite brute force rm and changing table_name

ThuyINACTIVETra · ‎05-01-2016

Hi,

I get this exception when using df.write.parquet(), on both overwrite and default mode, for completely new location.

The exception is intermittent and causes our data pipeline crash randomly.

Spark version: 1.6.0

Has anyone have more information about this?

ReKa · ‎05-01-2016

In a similar problem following fixed the problem:

- Using Memory Optimised Nodes (Compute Optimised had problems)

- Tighter definition of schema (specially for nested clusters in pyspark, where order may matter)

- Using S3a mount instead of S3n mounts

- Using Hadooop 2 and Latest DB Spark 1.61

- Also could avoid problem partially by saving as json and converting to parquet at the end (But watch for zero-sized files which can show corrupt partitions)

ThuyINACTIVETra · ‎05-01-2016

@ReKa thank heaps.

I'm using S3a already, and schema is very clearly defined and look like this:

OUTPUT_SCHEMA = StructType([ 
    StructField("c1", StringType(), True), 
    StructField("c2", ArrayType(StringType()), True), 
    StructField("c3", ShortType(), True), 
    StructField("c4", BooleanType(), True) 
])

I think this schema is tight enough.

On the notes:

+ "Compute Optimised had problems": do you what types of problems it has? or only writing data?

+ json and converting to parquet at the end (But watch for zero-sized files which can show corrupt partitions): do you have more information about this?

Many thanks.

ReKa · ‎05-02-2016

Your schema is tight, but make sure that the conversion to it does not throw an exception.

Try with Memory Optimized Nodes, you may be fine.

My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table. In my case, the main bottle-neck was moving data inside AWS (from S3 to spark nodes)

df.write.mode('overwrite').format('parquet').saveAsTable(new_name) #change parquet to jsonsonce When your job is finished look at the hive directory for above table and see how many files are 0 sized.

ThuyINACTIVETra · ‎05-03-2016

@Reka, thank you. I have exception in converting data type. Hopefully the issue won't happen again after the fix.

Databricks

Why do I get 'java.io.IOException: File already exists' for saveAsTable with Overwrite mode?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs