article High Throughput ‘Exactly Once’ Streaming from Google Pub/Sub with Structured Streaming on Databricks in Technical Blog

High Throughput ‘Exactly Once’ Streaming from Google Pub/Sub with Structured Streaming on Databricks

thewizard — Mon, 11 Dec 2023 17:48:51 GMT

Re: High Throughput ‘Exactly Once’ Streaming from Google Pub/Sub with Structured Streaming on Databr

aerofish — Fri, 07 Jun 2024 03:06:33 GMT

Thanks for the detailed blog! I'm testing the similar scenario. However in the production, we have to add additional deduplication to achieve the exactly once from data sources until delta table. No data source publisher can ensure exactly once delivery. Therefore we apply the dropDuplicatesWithinWatermark with unique event ID, which is recommended in https://docs.gcp.databricks.com/en/structured-streaming/watermarks.html. According to my test, we do lose events from time to time as long as adding dropDuplicatesWithinWatermark. Without dropDuplicatesWithinWatermark, I never lose events.

@thewizard: Do you test such scenario? Or do you know the detailed design and implementation of dropDuplicatesWithinWatermark would cause issue? Thanks!

Re: High Throughput ‘Exactly Once’ Streaming from Google Pub/Sub with Structured Streaming on Databr

thewizard — Fri, 07 Jun 2024 07:35:08 GMT

I don't know enough about your configuration and use case to really answer this. There shouldn't be duplicates written by the pub/sub connector, as they are deduped in RocksDB. However they are deduped using the messageId field, which may be different to your "unique event ID". There is no way of the pub/sub connector deduping on something other than the messageId, so if there are dupliate records coming in to pub/sub, then you would stil need to use dropDuplicatesWithinWatermark to get rid of them.

Re: High Throughput ‘Exactly Once’ Streaming from Google Pub/Sub with Structured Streaming on Databr

aerofish — Tue, 11 Jun 2024 02:33:04 GMT

Hi @thewizard , thanks for your quick reply. Yes, I have to do the deduplication based on my business unique ID. Because the message publisher (neither pub/sub, nor Databricks connector) could cause the duplication due to retry publishing. Therefore I'm relying on dropDuplicatesWithinWatermark to do deduplication, and this method cause message lost...

Re: High Throughput ‘Exactly Once’ Streaming from Google Pub/Sub with Structured Streaming on Databr

thewizard — Thu, 20 Jun 2024 09:21:06 GMT

Apologies for late reply, I was at DAIS. You should not receive a message to be dropped with dropDuplicatesWithinWatermark (except where it is a genuine duplicate, based on the specific keys). So I would raise a support ticket if the problem persists so that we can investigate.