cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks overwrite didn't delete previous data

Hi databricks, we met an issue like below picture shows:

_0-1729067185207.png

we use pyspark api to store data into ADLS :

df.write.partitionBy("xx").option("partitionOverwriteMode","dynamic").mode("overwrite").parquet(xx)
However, not sure why the second time we overwrite this partition on 2024-09-26 4:29 PM, the previous data still exists...
 
The last committed log "_committed_3404689632661433446"  shows like below: what has been removed is not tid 3175486376768535369 which was run on 2024-09-26 4:23 PM, what was removed was the data file in tid 
2405413862834130470 which was run on 2024-09-20...
_1-1729067620519.png
 
Does anyone know the root cause? and how to removed those data which should already be deleted? Thanks!

 

 

3 REPLIES 3

Himanshu6
New Contributor

Ensure Correct Partition Column Values : Double-check that the values in your partition column "xx" are consistent across the dataset. Make sure there are no formatting issues or null values.

Himanshu Verma

Himanshu6
New Contributor

To more Clarify in Delta lake if you are writing or overwriting some data then it is Creating the new version if you see in table then you will be able to see the new Data but when you check in  location the parquet file would be present old and latest.

If your motive is to write data in ADLS and use that parquet file then use .format('parquet')

Himanshu Verma

Panda
Contributor II

@้˜ณๅ…‰ๅฝฉ่™นๅฐ็™ฝ้ฉฌ

The issue you're encountering seems to involve inconsistent behavior in partition overwrites using PySpark with ADLS.

Can you validate the below along with what @Himanshu6 mentioned.

  1. Force Spark to refresh the metadata of the data lake directory.
  2. Ensure that the mode(partitionOverwriteMode) is set properly before executing the overwrite operation.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group