<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19821#M13350</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;In my experience, if you use dropDuplicates(), Spark will keep a random row.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Therefore, you should define a logic to remove duplicated rows.&lt;/P&gt;</description>
    <pubDate>Wed, 30 Nov 2022 09:30:19 GMT</pubDate>
    <dc:creator>NhatHoang</dc:creator>
    <dc:date>2022-11-30T09:30:19Z</dc:date>
    <item>
      <title>How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19818#M13347</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I need to find&amp;nbsp;all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# Prepare Data
data = [("A", "A", 1), \
    ("A", "A", 2), \
    ("A", "A", 3), \
    ("A", "B", 4), \
    ("A", "B", 5), \
    ("A", "C", 6), \
    ("A", "D", 7), \
    ("A", "E", 8), \
  ]
&amp;nbsp;
# Create DataFrame
columns= ["col_1", "col_2", "col_3"]
df = spark.createDataFrame(data = data, schema = columns)
df.printSchema()
df.show(truncate=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1091i0B326CF1D4597464/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When I try the following code:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;primary_key = ['col_1', 'col_2']
&amp;nbsp;
duplicate_records = df.exceptAll(df.dropDuplicates(primary_key))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The output will be:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1090i4DFDD526F6518CCD/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As you can see, I don't get all occurrences of duplicate records based on the Primary Key, since one instance of duplicate records is present in "df.dropDuplicates(primary_key)". The 1st and the 4th records of the dataset must be in the output.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any idea to solve this issue? &lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 06:55:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19818#M13347</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2022-11-30T06:55:53Z</dc:date>
    </item>
    <item>
      <title>Re: How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19819#M13348</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Getting the not duplicated records and doing 'left_anti' join should do the trick.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;not_duplicate_records = df.groupBy(primary_key).count().where('count = 1').drop('count')
&amp;nbsp;
duplicate_records = df.join(not_duplicate_records, on=primary_key, how='left_anti').show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1087iEFB1F20FB256783D/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 07:26:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19819#M13348</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2022-11-30T07:26:49Z</dc:date>
    </item>
    <item>
      <title>Re: How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19820#M13349</link>
      <description>&lt;P&gt;@Mohammad Saber​&amp;nbsp;how about using window function like below&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;windowSpec  = Window.partitionBy(*primary_key)
&amp;nbsp;
df.withColumn("primary_key_count",F.count("*").over(windowSpec)).filter(F.col("primary_key_count") &amp;gt; 1).drop("primary_key_count").show(truncate=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 08:44:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19820#M13349</guid>
      <dc:creator>Shalabh007</dc:creator>
      <dc:date>2022-11-30T08:44:00Z</dc:date>
    </item>
    <item>
      <title>Re: How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19821#M13350</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;In my experience, if you use dropDuplicates(), Spark will keep a random row.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Therefore, you should define a logic to remove duplicated rows.&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 09:30:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-all-occurrences-of-duplicate-records-in-a-pyspark/m-p/19821#M13350</guid>
      <dc:creator>NhatHoang</dc:creator>
      <dc:date>2022-11-30T09:30:19Z</dc:date>
    </item>
  </channel>
</rss>

