cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

drop duplicates within watermark

aerofish
New Contributor III

Recently we are using structured streaming to ingest data. We want to use watermark to drop duplicated event. But We encountered some wired behavior and unexpected exception. Anyone can help me to explain what is the expected behavior and how should I use these method in a right ways?

I have four scenarios:

  1. ingest from json file to delta table: I use withWatermark + dorpDuplicates
    behavior: it will drop all duplicates within the watermark and also drop all events (not only duplicated events) older than watermark. Is this expected behavior?
  2. ingest from delta table to delta table: I use withWatermark + dropduplicates
    behavior: it will drop all duplicates within the watermark and also drop duplicated event older than watermark
  3. ingest from delta table to delta table; withWatermark + dropDuplicatesWithinWatermark
    behavior: I tested with the new introduce method - dropDuplicatesWithinWatermark. Every time It will throw error: java.util.NoSuchElementException: None.get. It's a generic exception. Can anyone explain why I got this error by doing just basic invocation of dropDuplicatesWithinWatermark?
  4. ingest from json file to delta table; withWatermark + dropduplicatewithwatermark
    behavior: it will drop duplicates within wartermark, and also drop every event older than watermark. So the behavior is different compare to 3rd scenario(Same method, but from delta table to delta table)

Should I use dropDuplicatesWithinWatermark? it throws exception when doing delta table to delta table ingestion. Is it a bug?

Thanks!

3 REPLIES 3

Max_Liu
New Contributor II

I can confirm we are also getting the same error with the case NO. 3:
ingest from delta table to delta table; withWatermark + dropDuplicatesWithinWatermark
behavior: I tested with the new introduce method - dropDuplicatesWithinWatermark. Every time It will throw error: java.util.NoSuchElementException: None.get. It's a generic exception.

Max_Liu_0-1695956349652.png

 

aerofish
New Contributor III

Thanks for sharing your experience!

Waiting for more explanation and solutions...

aerofish
New Contributor III

Any maintainer can help me on this question??

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group