cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Mado
by Valued Contributor II
  • 20168 Views
  • 3 replies
  • 10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

image image
  • 20168 Views
  • 3 replies
  • 10 kudos
Latest Reply
NhatHoang
Valued Contributor II
  • 10 kudos

Hi,​In my experience, if you use dropDuplicates(), Spark will keep a random row.​Therefore, you should define a logic to remove duplicated rows.

  • 10 kudos
2 More Replies
Mado
by Valued Contributor II
  • 1618 Views
  • 2 replies
  • 3 kudos

Question about "foreachBatch" to remove duplicate records when streaming data

Hi,I am practicing with Databricks sample notebook published here:https://github.com/databricks-academy/advanced-data-engineering-with-databricksIn one of the notebooks (ADE 3.1 - Streaming Deduplication) (URL), there is a sample code to remove dupli...

  • 1618 Views
  • 2 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

Hi @Mohammad Saber​ Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon. Thanks

  • 3 kudos
1 More Replies
User16869510359
by Esteemed Contributor
  • 2043 Views
  • 1 replies
  • 0 kudos

Resolved! Can Spark JDBC create duplicate records

Is it transaction safe?Does it ensure atomicity

  • 2043 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design. When Apache Spa...

  • 0 kudos
Labels