cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Saving to parquet with SaveMode.Overwrite throws exception

KrisMusial
New Contributor

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success.

I minimized the code and reproduced the issue with the following two cells:

> case class MyClass(val fld1: Integer, val fld2: Integer)
> 
> val lst1 = sc.parallelize(List(MyClass(1, 2), MyClass(1, 3))).toDF
> lst1.show
> lst1.write.mode(SaveMode.Overwrite).parquet("/mnt/lf/write-test/lst1.parquet")
> case class MyClass(val fld1: Integer, val fld2: Integer)
> 
> val lst1 = sqlContext.read.parquet("/mnt/lf/write-test/lst1.parquet")
> val lst2 = sc.parallelize(List(MyClass(1, 4), MyClass(2, 3))).toDF
> lst1.registerTempTable("tbl1")
> lst2.registerTempTable("tbl2")
> 
> val sql = """
>   SELECT t1.*
>     FROM tbl1 t1
>     LEFT JOIN tbl2 t2 ON t2.fld1 = t1.fld1
>     WHERE t2.fld1 IS NULL
>   UNION
>   SELECT t2.*
>    FROM tbl2 t2
> """
> val lst3 = sqlContext.sql(sql)
> lst3.show
> lst3.write.mode(SaveMode.Overwrite).parquet("/mnt/lf/write-test/lst1.parquet")

The idea is to update saved DataFrame by replacing it with the new content. The new content is derived from the previously saved copy and a new DataFrame. After executing the first cell and the second cell with the last line commented out lst3.show shows the correct updated content.

However, an attempt to save lst1.parquet again throws an exception:

org.apache.spark.SparkException: Job aborted.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 48.0 failed 1 times, most recent failure: Lost task 0.0 in stage 48.0 (TID 1779, localhost): java.io.FileNotFoundException: /mnt/lf/save-test/lst1.parquet/part-r-00000-a119b6a9-64a6-4ba7-ba87-ad24341f7eea.gz.parquetat com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:65)at com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:42)

...

I appreciate any help.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

miklos
Contributor

The reason this causes a problem is that we're reading and writing to the same path that we're trying to overwrite. This causes an issue since the data cannot be stream into the same directory we're attempting to overwrite.

I'd recommend adding a temporary location to do rewrite this data.

View solution in original post

2 REPLIES 2

miklos
Contributor

The reason this causes a problem is that we're reading and writing to the same path that we're trying to overwrite. This causes an issue since the data cannot be stream into the same directory we're attempting to overwrite.

I'd recommend adding a temporary location to do rewrite this data.

Guru421421
New Contributor II

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.