<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: drop duplicate in 500B records in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/86091#M37291</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/82150"&gt;@ImAbhishekTomar&lt;/a&gt;, How are you doing today?&lt;/P&gt;&lt;P&gt;To speed up your job,Give a try repartitioning the DataFrame by the columns you're dropping duplicates on before running dropDuplicates. You could also checkpoint the DataFrame to simplify its lineage. If that doesn't help, consider using a group by method instead of dropDuplicates or optimizing your Delta tables with Z-ordering. Lastly, make sure your cluster has enough resources to handle the load.&lt;/P&gt;&lt;P&gt;Give a try and let me know if it works.&lt;/P&gt;&lt;P&gt;Good day.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Brahma&lt;/P&gt;</description>
    <pubDate>Thu, 29 Aug 2024 02:45:52 GMT</pubDate>
    <dc:creator>Brahmareddy</dc:creator>
    <dc:date>2024-08-29T02:45:52Z</dc:date>
    <item>
      <title>drop duplicate in 500B records</title>
      <link>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/85962#M37288</link>
      <description>&lt;P&gt;I’m trying to drop duplicate in a DF where I have 500B records I’m trying to delete &amp;nbsp;based on multiple columns but this process it’s takes 5h, I try lot of things that available on internet but nothing is works for me.&lt;/P&gt;&lt;P&gt;my code is like this.&lt;/P&gt;&lt;P&gt;df_1=spark.read.format(delta).table(t1) - 60M -200 partition&lt;BR /&gt;&lt;SPAN&gt;df_2=spark.read.format(delta).table(t2) - 8M - 160 partition&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;df_join=df_1.join(broadcast(df_2),city_code,left) - 500B - 300 partition&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;till here my job is only taking 1mins to process this data but when I add below line it’s takes 5hours&lt;/P&gt;&lt;P&gt;df_clean=df_join.dropDuplicate([col1,col2,col3])&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Aug 2024 20:42:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/85962#M37288</guid>
      <dc:creator>ImAbhishekTomar</dc:creator>
      <dc:date>2024-08-28T20:42:16Z</dc:date>
    </item>
    <item>
      <title>Re: drop duplicate in 500B records</title>
      <link>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/86091#M37291</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/82150"&gt;@ImAbhishekTomar&lt;/a&gt;, How are you doing today?&lt;/P&gt;&lt;P&gt;To speed up your job,Give a try repartitioning the DataFrame by the columns you're dropping duplicates on before running dropDuplicates. You could also checkpoint the DataFrame to simplify its lineage. If that doesn't help, consider using a group by method instead of dropDuplicates or optimizing your Delta tables with Z-ordering. Lastly, make sure your cluster has enough resources to handle the load.&lt;/P&gt;&lt;P&gt;Give a try and let me know if it works.&lt;/P&gt;&lt;P&gt;Good day.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Brahma&lt;/P&gt;</description>
      <pubDate>Thu, 29 Aug 2024 02:45:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/86091#M37291</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2024-08-29T02:45:52Z</dc:date>
    </item>
    <item>
      <title>Re: drop duplicate in 500B records</title>
      <link>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/86655#M37326</link>
      <description>&lt;P&gt;Drop the duplicates from the df_1 and df_2 first and then do the join.&lt;BR /&gt;If the join is just a city code, then most likely you know which rows in df_2 and in df_1 will give you the duplicates in df_join. So drop in df_1 and drop in df_2 instead of df_join.&lt;/P&gt;</description>
      <pubDate>Thu, 29 Aug 2024 20:48:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/drop-duplicate-in-500b-records/m-p/86655#M37326</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-08-29T20:48:43Z</dc:date>
    </item>
  </channel>
</rss>

