Databricks Community

阳光彩虹小白马 · ‎10-16-2024

Hi databricks, we met an issue like below picture shows:

we use pyspark api to store data into ADLS :

df.write.partitionBy("xx").option("partitionOverwriteMode","dynamic").mode("overwrite").parquet(xx)

However, not sure why the second time we overwrite this partition on 2024-09-26 4:29 PM, the previous data still exists...

The last committed log "_committed_3404689632661433446" shows like below: what has been removed is not tid 3175486376768535369 which was run on 2024-09-26 4:23 PM, what was removed was the data file in tid

2405413862834130470 which was run on 2024-09-20...

Does anyone know the root cause? and how to removed those data which should already be deleted? Thanks!

Himanshu6 · ‎10-17-2024

Ensure Correct Partition Column Values : Double-check that the values in your partition column "xx" are consistent across the dataset. Make sure there are no formatting issues or null values.

Himanshu Verma

Himanshu6 · ‎10-17-2024

To more Clarify in Delta lake if you are writing or overwriting some data then it is Creating the new version if you see in table then you will be able to see the new Data but when you check in location the parquet file would be present old and latest.

If your motive is to write data in ADLS and use that parquet file then use .format('parquet')

Himanshu Verma

Panda · ‎10-17-2024

@阳光彩虹小白马

The issue you're encountering seems to involve inconsistent behavior in partition overwrites using PySpark with ADLS.

Can you validate the below along with what @Himanshu6 mentioned.

Force Spark to refresh the metadata of the data lake directory.
Ensure that the mode(partitionOverwriteMode) is set properly before executing the overwrite operation.

Databricks Community

Databricks overwrite didn't delete previous data

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon