Databricks Community

KrisMusial · ‎07-31-2016

Hello, I'm trying to save DataFrame in parquet with SaveMode.Overwrite with no success.

I minimized the code and reproduced the issue with the following two cells:

> case class MyClass(val fld1: Integer, val fld2: Integer)
> 
> val lst1 = sc.parallelize(List(MyClass(1, 2), MyClass(1, 3))).toDF
> lst1.show
> lst1.write.mode(SaveMode.Overwrite).parquet("/mnt/lf/write-test/lst1.parquet")

> case class MyClass(val fld1: Integer, val fld2: Integer)
> 
> val lst1 = sqlContext.read.parquet("/mnt/lf/write-test/lst1.parquet")
> val lst2 = sc.parallelize(List(MyClass(1, 4), MyClass(2, 3))).toDF
> lst1.registerTempTable("tbl1")
> lst2.registerTempTable("tbl2")
> 
> val sql = """
>   SELECT t1.*
>     FROM tbl1 t1
>     LEFT JOIN tbl2 t2 ON t2.fld1 = t1.fld1
>     WHERE t2.fld1 IS NULL
>   UNION
>   SELECT t2.*
>    FROM tbl2 t2
> """
> val lst3 = sqlContext.sql(sql)
> lst3.show
> lst3.write.mode(SaveMode.Overwrite).parquet("/mnt/lf/write-test/lst1.parquet")

The idea is to update saved DataFrame by replacing it with the new content. The new content is derived from the previously saved copy and a new DataFrame. After executing the first cell and the second cell with the last line commented out lst3.show shows the correct updated content.

However, an attempt to save lst1.parquet again throws an exception:

org.apache.spark.SparkException: Job aborted.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 48.0 failed 1 times, most recent failure: Lost task 0.0 in stage 48.0 (TID 1779, localhost): java.io.FileNotFoundException: /mnt/lf/save-test/lst1.parquet/part-r-00000-a119b6a9-64a6-4ba7-ba87-ad24341f7eea.gz.parquetat com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:65)at com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:42)

...

I appreciate any help.

Thanks.

miklos · ‎08-12-2016

The reason this causes a problem is that we're reading and writing to the same path that we're trying to overwrite. This causes an issue since the data cannot be stream into the same directory we're attempting to overwrite.

I'd recommend adding a temporary location to do rewrite this data.

View solution in original post

miklos · ‎08-12-2016

The reason this causes a problem is that we're reading and writing to the same path that we're trying to overwrite. This causes an issue since the data cannot be stream into the same directory we're attempting to overwrite.

I'd recommend adding a temporary location to do rewrite this data.