cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

SkipChangeCommit to True Scenario on Data Loss Possibility

Naveenkumar1811
New Contributor II

Hi Team,

I have Below Scenario,

I have a Spark Streaming Job with trigger of Processing time as 3 secs Running Continuously 365 days.

We are performing a weekly delete job from the source of this streaming job based on custom retention policy. it is a Delete command on the delta table(external).

If i implement SkipChangeCommit to True in my ReadStream, Will i have an Dataloss in my streaming Job... 

My source is Bronze delta lake external table loaded in append mode only.

The Reason i want to make sure is the option will skip the entire commit so i want to know if both my weekly delete and an insert to my source data might fall under same commit and the option will skip the entire commit causing the data loss.

Please review and scenario and let me know if there is a potential data loss possibility with this option. 

1 ACCEPTED SOLUTION

Accepted Solutions

The short answer is no: independent operations from different jobs become separate, serialized commits in the Delta transaction log. They wonโ€™t be coalesced into one commit unless you explicitly run a single statement that performs both (for example, a MERGE/OVERWRITE that rewrites files and inserts rows).

Some practical guidelines:

 

  • Keep ingestion appends and retention deletes as separate statements/jobs so they become separate commits and skipChangeCommits only skips the delete commit

 

  • Avoid MERGE/OVERWRITE that mixes rewrites and inserts in the source Bronze table. If you must, expect the commit to be skipped entirely by skipChangeCommits.
  • If concurrent operations overlap in time, they are still serialized as distinct commits. Streaming reads will see them as separate versions in order.

This blog post does a great job of explaining the delta transaction log: https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html

 

 

View solution in original post

4 REPLIES 4

Raman_Unifeye
Contributor III

Short answer is: No, implementing skipChangeCommits will not cause data loss in your streaming job from new inserts, assuming your source table operations are transactional (as a Delta table).

If your source was a table that included regular UPDATE or MERGE operations that you did need to capture, then using skipChangeCommits=true would cause data loss of those updated/merged records. Since your source is an append-only Bronze table, this should not be a concern for you.

 


RG #Driving Business Outcomes with Data Intelligence

szymon_dybczak
Esteemed Contributor III

It shouldn't. You have append only stream and SkipChangeCommit will ignore any modification that were applied to already existing files

szymon_dybczak_0-1763390934234.png

 

Naveenkumar1811
New Contributor II

Hi szymon/Raman,

My Question was on the commit it performs with the insert/append via my streaming and the delete operation by the weekly maintenance Job... Is there a way that both transaction will fall into same commit. I need to understand that portion so it gives me clear picture of data loss during my skipchangecommit.

The short answer is no: independent operations from different jobs become separate, serialized commits in the Delta transaction log. They wonโ€™t be coalesced into one commit unless you explicitly run a single statement that performs both (for example, a MERGE/OVERWRITE that rewrites files and inserts rows).

Some practical guidelines:

 

  • Keep ingestion appends and retention deletes as separate statements/jobs so they become separate commits and skipChangeCommits only skips the delete commit

 

  • Avoid MERGE/OVERWRITE that mixes rewrites and inserts in the source Bronze table. If you must, expect the commit to be skipped entirely by skipChangeCommits.
  • If concurrent operations overlap in time, they are still serialized as distinct commits. Streaming reads will see them as separate versions in order.

This blog post does a great job of explaining the delta transaction log: https://www.databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now