Databricks Community

6502 · ‎05-02-2024

I deleted for mistake some records from a streaming table, and of course, the streaming job stopped working.

So I restored the table at the version before the delete was done, and attempted to restart the job using the startingVersion to the new version. I did not delete the checkpoint on the first attempt, and the job failed again. As a second attempt, I deleted the checkpoint and the job still did not start, somehow the code was still detecting the deleted rows. Can someone explain to me why did it happen?

Deleting the checkpoint and not passing the startingVersion works, of course. But I see that the checkpoint file reports:

{"sourceVersion":1,"reservoirId":"963e2797-2f22-449a-91c6-c3e3972e4ea5","reservoirVersion":1254,"index":8,"isStartingVersion":true}

Why is telling that isStartingVersion true? Did it get the startingVersion I passed? If so, why the job did not start when startVersion was provided?

raphaelblg · ‎05-02-2024

Hello @6502,

It appears you've used the `startingVersion` parameter in your streaming query, which causes the stream to begin processing data from the version prior to the DELETE operation version. However, the DELETE operation will still be processed in order, potentially resulting in failures.

To resolve this issue, consider the following options:

Roll back your table version to the version before the DELETE operation using time travel.

(https://docs.databricks.com/en/delta/history.html#restore-a-delta-table-to-an-earlier-state)

or

2. Add the `ignoreDeletes` or `skipChangeCommits` parameter to your query. You can find more information on this in the Databricks documentation.

(https://docs.databricks.com/en/structured-streaming/delta-lake.html#ignore-updates-and-deletes)

Should you have any questions or concerns, please don't hesitate to respond to this message. I'm here to help!

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Databricks Community

Delete on streaming table and starting startingVersion

Connect with Databricks Users in Your Area

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure