Databricks Community

pgruetter · ‎11-13-2023

Hi all

To read from a large Delta table, I'm using readStream but with a trigger(availableNow=True) as I only want to run it daily. This worked well for an intial load and then incremental loads after that.

At some point though, I received an error from the source Delta table that a parquet file referenced by the index is not available anymore.

I know that a VACUUM command is periodically issued against the source table but with the default of 7 days.
My incremental load was not executed for 2 weeks. Could that be a problem?

How does readStream work exactly: If it ran 2 weeks ago, will it try to read all table versions since then? That could explain the error as it would reference parquet files from > 7 days.

Thanks

Kaniz_Fatma · ‎11-22-2023

Hi @pgruetter , Certainly! Let’s delve into the behavior of readStream in the context of Delta tables and address your questions.

Delta Table Streaming with readStream:

When you use readStream to read from a Delta table, it operates in an incremental manner.
As new data arrives, it processes it idempotently based on the table versions committed to the source table.

Incremental Processing:

If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.
Instead, it will process only the new records introduced during that time period.
This incremental approach ensures efficient and up-to-date processing.

VACUUM Command and File Retention:

The VACUUM command periodically cleans up old files in the Delta table.
By default, it retains data for 7 days.
If your incremental load was not executed for 2 weeks, it’s possible that some files were removed by VACUUM.

Error Related to Missing Parquet Files:

The error you encountered about a missing parquet file is likely due to the retention policy.
If a file referenced by an index is no longer available (due to VACUUM), it can cause issues during processing.

Handling Schema Changes:

If the schema of the Delta table changes while a streaming read is active, the query may fail.
For most schema changes, you can restart the stream to resolve schema mismatches and continue processing.

Considerations:

Ensure that your streaming queries are scheduled appropriately (e.g., daily) to avoid gaps.
Monitor the retention duration and adjust it if needed.
If you encounter data loss due to VACUUM, consider adjusting the failOnDataLoss option to continue processing despite lost data.

Remember that Delta Lake provides robust streaming capabilities, but understanding its behaviour and configuring your pipeline accordingly is crucial. Feel free to fine-tune your setup based on your specific requirements.

pgruetter · ‎11-24-2023

Thanks a lot for the details. One point I still don't get is the difference between these two points (and let's forget vacuum for this):

If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.
Instead, it will process only the new records introduced during that time period.

Let's say my source delta table version is 2500. I execute a streaming job once with availableNow=True. So it loads everything up to table version 2500.

Now for two weeks I insert, delete and update data in this source table. After 2 weeks, I'm at version 2750. Now I execute the streaming job again.

I don't understand the difference: everything between versions 2500 and 2750 is exactly what has changed? Does the second bullet point mean, it only processes inserts but not deletes and updates?

Thanks for clearifying.

Databricks Community

Streaming problems after Vaccum

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Community Champion - September 2024 - Szymon Dybczak

Intelligent Data Engineering: Beyond the AI Hype

GenAI: The Shift to Data Intelligence

Big Book of Data Engineering — 3rd Edition