<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks overwrite didn't delete previous data in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94419#M38904</link>
    <description>&lt;P&gt;To more Clarify in Delta lake if you are writing or overwriting some data then it is Creating the new version if you see in table then you will be able to see the new Data but when you check in&amp;nbsp; location the parquet file would be present old and latest.&lt;BR /&gt;&lt;BR /&gt;If your motive is to write data in ADLS and use that parquet file then use .format('parquet')&lt;/P&gt;</description>
    <pubDate>Thu, 17 Oct 2024 08:19:54 GMT</pubDate>
    <dc:creator>Himanshu6</dc:creator>
    <dc:date>2024-10-17T08:19:54Z</dc:date>
    <item>
      <title>Databricks overwrite didn't delete previous data</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94259#M38848</link>
      <description>&lt;P&gt;Hi databricks, we met an issue like below picture shows:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="_0-1729067185207.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11981iDC4649BE270B398A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="_0-1729067185207.png" alt="_0-1729067185207.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;we use pyspark api to store data into ADLS :&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.write.partitionBy(&lt;/SPAN&gt;&lt;SPAN&gt;"xx"&lt;/SPAN&gt;&lt;SPAN&gt;).option(&lt;/SPAN&gt;&lt;SPAN&gt;"partitionOverwriteMode"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"dynamic"&lt;/SPAN&gt;&lt;SPAN&gt;).mode(&lt;/SPAN&gt;&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;&lt;SPAN&gt;).parquet(&lt;/SPAN&gt;&lt;SPAN&gt;xx&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;However, not sure why the second time we overwrite this partition on &lt;STRONG&gt;2024-09-26 4:29 PM&lt;/STRONG&gt;, the previous data still exists...&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;The last committed log "_committed_3404689632661433446"&amp;nbsp; shows like below: what has been removed is not tid&amp;nbsp;3175486376768535369 which was run on&amp;nbsp;&lt;STRONG&gt;2024-09-26 4:23 PM, &lt;/STRONG&gt;what was removed was the data file in tid&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;2405413862834130470 which was run on&amp;nbsp;&lt;STRONG&gt;2024-09-20...&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="_1-1729067620519.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11982i696A2D9E8671BC28/image-size/medium?v=v2&amp;amp;px=400" role="button" title="_1-1729067620519.png" alt="_1-1729067620519.png" /&gt;&lt;/span&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Does anyone know the root cause? and how to removed those data which should already be deleted? Thanks!&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Oct 2024 08:39:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94259#M38848</guid>
      <dc:creator>阳光彩虹小白马</dc:creator>
      <dc:date>2024-10-16T08:39:34Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks overwrite didn't delete previous data</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94410#M38901</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Ensure Correct Partition Column Values&amp;nbsp;&lt;/STRONG&gt;: Double-check that the values in your partition column "xx" are consistent across the dataset. Make sure there are no formatting issues or null values.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Oct 2024 07:06:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94410#M38901</guid>
      <dc:creator>Himanshu6</dc:creator>
      <dc:date>2024-10-17T07:06:54Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks overwrite didn't delete previous data</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94419#M38904</link>
      <description>&lt;P&gt;To more Clarify in Delta lake if you are writing or overwriting some data then it is Creating the new version if you see in table then you will be able to see the new Data but when you check in&amp;nbsp; location the parquet file would be present old and latest.&lt;BR /&gt;&lt;BR /&gt;If your motive is to write data in ADLS and use that parquet file then use .format('parquet')&lt;/P&gt;</description>
      <pubDate>Thu, 17 Oct 2024 08:19:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94419#M38904</guid>
      <dc:creator>Himanshu6</dc:creator>
      <dc:date>2024-10-17T08:19:54Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks overwrite didn't delete previous data</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94455#M38908</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/127321"&gt;@阳光彩虹小白马&lt;/a&gt;&lt;/P&gt;&lt;P&gt;The issue you're encountering seems to involve inconsistent behavior in partition overwrites using PySpark with ADLS.&lt;BR /&gt;&lt;BR /&gt;Can you validate the below along with what&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/127051"&gt;@Himanshu6&lt;/a&gt;&amp;nbsp;mentioned.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Force Spark to refresh the metadata of the data lake directory.&lt;/LI&gt;&lt;LI&gt;Ensure that the mode(partitionOverwriteMode) is set properly before executing the overwrite operation.&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 17 Oct 2024 09:24:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-overwrite-didn-t-delete-previous-data/m-p/94455#M38908</guid>
      <dc:creator>Panda</dc:creator>
      <dc:date>2024-10-17T09:24:52Z</dc:date>
    </item>
  </channel>
</rss>

