Databricks overwrite didn't delete previous data
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-16-2024 01:39 AM
Hi databricks, we met an issue like below picture shows:
we use pyspark api to store data into ADLS :
df.write.partitionBy("xx").option("partitionOverwriteMode","dynamic").mode("overwrite").parquet(xx)
However, not sure why the second time we overwrite this partition on 2024-09-26 4:29 PM, the previous data still exists...
The last committed log "_committed_3404689632661433446" shows like below: what has been removed is not tid 3175486376768535369 which was run on 2024-09-26 4:23 PM, what was removed was the data file in tid
2405413862834130470 which was run on 2024-09-20...
Does anyone know the root cause? and how to removed those data which should already be deleted? Thanks!