Databricks Community

阳光彩虹小白马 · ‎10-16-2024

Hi databricks, we met an issue like below picture shows:

we use pyspark api to store data into ADLS :

df.write.partitionBy("xx").option("partitionOverwriteMode","dynamic").mode("overwrite").parquet(xx)

However, not sure why the second time we overwrite this partition on 2024-09-26 4:29 PM, the previous data still exists...

The last committed log "_committed_3404689632661433446" shows like below: what has been removed is not tid 3175486376768535369 which was run on 2024-09-26 4:23 PM, what was removed was the data file in tid

2405413862834130470 which was run on 2024-09-20...

Does anyone know the root cause? and how to removed those data which should already be deleted? Thanks!

Himanshu6 · ‎10-17-2024

Ensure Correct Partition Column Values : Double-check that the values in your partition column "xx" are consistent across the dataset. Make sure there are no formatting issues or null values.

Himanshu Verma

Himanshu6 · ‎10-17-2024

To more Clarify in Delta lake if you are writing or overwriting some data then it is Creating the new version if you see in table then you will be able to see the new Data but when you check in location the parquet file would be present old and latest.

If your motive is to write data in ADLS and use that parquet file then use .format('parquet')

Himanshu Verma

Panda · ‎10-17-2024

@阳光彩虹小白马

The issue you're encountering seems to involve inconsistent behavior in partition overwrites using PySpark with ADLS.

Can you validate the below along with what @Himanshu6 mentioned.

Force Spark to refresh the metadata of the data lake directory.
Ensure that the mode(partitionOverwriteMode) is set properly before executing the overwrite operation.

Databricks Community

Databricks overwrite didn't delete previous data

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs