cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Saving to parquet with SaveMode.Overwrite throws exception

KrisMusial
New Contributor

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success.

I minimized the code and reproduced the issue with the following two cells:

> case class MyClass(val fld1: Integer, val fld2: Integer)
> 
> val lst1 = sc.parallelize(List(MyClass(1, 2), MyClass(1, 3))).toDF
> lst1.show
> lst1.write.mode(SaveMode.Overwrite).parquet("/mnt/lf/write-test/lst1.parquet")
> case class MyClass(val fld1: Integer, val fld2: Integer)
> 
> val lst1 = sqlContext.read.parquet("/mnt/lf/write-test/lst1.parquet")
> val lst2 = sc.parallelize(List(MyClass(1, 4), MyClass(2, 3))).toDF
> lst1.registerTempTable("tbl1")
> lst2.registerTempTable("tbl2")
> 
> val sql = """
>   SELECT t1.*
>     FROM tbl1 t1
>     LEFT JOIN tbl2 t2 ON t2.fld1 = t1.fld1
>     WHERE t2.fld1 IS NULL
>   UNION
>   SELECT t2.*
>    FROM tbl2 t2
> """
> val lst3 = sqlContext.sql(sql)
> lst3.show
> lst3.write.mode(SaveMode.Overwrite).parquet("/mnt/lf/write-test/lst1.parquet")

The idea is to update saved DataFrame by replacing it with the new content. The new content is derived from the previously saved copy and a new DataFrame. After executing the first cell and the second cell with the last line commented out lst3.show shows the correct updated content.

However, an attempt to save lst1.parquet again throws an exception:

org.apache.spark.SparkException: Job aborted.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 48.0 failed 1 times, most recent failure: Lost task 0.0 in stage 48.0 (TID 1779, localhost): java.io.FileNotFoundException: /mnt/lf/save-test/lst1.parquet/part-r-00000-a119b6a9-64a6-4ba7-ba87-ad24341f7eea.gz.parquetat com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:65)at com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:42)

...

I appreciate any help.

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

miklos
Contributor

The reason this causes a problem is that we're reading and writing to the same path that we're trying to overwrite. This causes an issue since the data cannot be stream into the same directory we're attempting to overwrite.

I'd recommend adding a temporary location to do rewrite this data.

View solution in original post

2 REPLIES 2

miklos
Contributor

The reason this causes a problem is that we're reading and writing to the same path that we're trying to overwrite. This causes an issue since the data cannot be stream into the same directory we're attempting to overwrite.

I'd recommend adding a temporary location to do rewrite this data.

Guru421421
New Contributor II

results.select("ValidationTable", "Results","Description","CreatedBy","ModifiedBy","CreatedDate","ModifiedDate").write.mode('overwrite').save("

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group