Topics with Label: Duplicate Records

Forum Posts

Sorted by:

by Mado • Valued Contributor II

11-29-2022 10:55:53 PM

19520 Views
3 replies
10 kudos

Resolved! How to get all occurrences of duplicate records in a PySpark DataFrame based on specific columns?

Hi,I need to find all occurrences of duplicate records in a PySpark DataFrame. Following is the sample dataset:# Prepare Data data = [("A", "A", 1), \ ("A", "A", 2), \ ("A", "A", 3), \ ("A", "B", 4), \ ("A", "B", 5), \ ("A", "C", ...

Data Engineering

19520 Views
3 replies
10 kudos

11-29-2022 10:55:53 PM

View Replies

Latest Reply

NhatHoang
Valued Contributor II

11-30-2022 1:30:19 AM

10 kudos

Hi,In my experience, if you use dropDuplicates(), Spark will keep a random row.Therefore, you should define a logic to remove duplicated rows.

10 kudos

11-30-2022 1:30:19 AM

2 More Replies

by Mado • Valued Contributor II

10-19-2022 10:23:57 PM

1571 Views
2 replies
3 kudos

Question about "foreachBatch" to remove duplicate records when streaming data

Hi,I am practicing with Databricks sample notebook published here:https://github.com/databricks-academy/advanced-data-engineering-with-databricksIn one of the notebooks (ADE 3.1 - Streaming Deduplication) (URL), there is a sample code to remove dupli...

Data Engineering

1571 Views
2 replies
3 kudos

10-19-2022 10:23:57 PM

View Replies

Latest Reply

Anonymous
Not applicable

11-27-2022 5:46:00 AM

3 kudos

Hi @Mohammad Saber Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else bricksters will get back to you soon. Thanks

3 kudos

11-27-2022 5:46:00 AM

1 More Replies

by User16869510359 • Esteemed Contributor

06-25-2021 11:14:02 AM

1978 Views
1 replies
0 kudos

Resolved! Can Spark JDBC create duplicate records

Is it transaction safe?Does it ensure atomicity

Data Engineering

1978 Views
1 replies
0 kudos

06-25-2021 11:14:02 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 11:17:06 AM

0 kudos

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design. When Apache Spa...

0 kudos

06-25-2021 11:17:06 AM