a week ago
Hi Team,
I have Below Scenario,
I have a Spark Streaming Job with trigger of Processing time as 3 secs Running Continuously 365 days.
We are performing a weekly delete job from the source of this streaming job based on custom retention policy. it is a Delete command on the delta table(external).
If i implement SkipChangeCommit to True in my ReadStream, Will i have an Dataloss in my streaming Job...
My source is Bronze delta lake external table loaded in append mode only.
The Reason i want to make sure is the option will skip the entire commit so i want to know if both my weekly delete and an insert to my source data might fall under same commit and the option will skip the entire commit causing the data loss.
Please review and scenario and let me know if there is a potential data loss possibility with this option.
Monday
The short answer is no: independent operations from different jobs become separate, serialized commits in the Delta transaction log. They won’t be coalesced into one commit unless you explicitly run a single statement that performs both (for example, a MERGE/OVERWRITE that rewrites files and inserts rows).
Some practical guidelines:
If concurrent operations overlap in time, they are still serialized as distinct commits. Streaming reads will see them as separate versions in order.
This blog post does a great job of explaining the delta transaction log: https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
a week ago
Short answer is: No, implementing skipChangeCommits will not cause data loss in your streaming job from new inserts, assuming your source table operations are transactional (as a Delta table).
If your source was a table that included regular UPDATE or MERGE operations that you did need to capture, then using skipChangeCommits=true would cause data loss of those updated/merged records. Since your source is an append-only Bronze table, this should not be a concern for you.
a week ago
It shouldn't. You have append only stream and SkipChangeCommit will ignore any modification that were applied to already existing files
Wednesday
Hi szymon/Raman,
My Question was on the commit it performs with the insert/append via my streaming and the delete operation by the weekly maintenance Job... Is there a way that both transaction will fall into same commit. I need to understand that portion so it gives me clear picture of data loss during my skipchangecommit.
Monday
The short answer is no: independent operations from different jobs become separate, serialized commits in the Delta transaction log. They won’t be coalesced into one commit unless you explicitly run a single statement that performs both (for example, a MERGE/OVERWRITE that rewrites files and inserts rows).
Some practical guidelines:
If concurrent operations overlap in time, they are still serialized as distinct commits. Streaming reads will see them as separate versions in order.
This blog post does a great job of explaining the delta transaction log: https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now