Streaming problems after Vaccum
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-13-2023 03:50 AM
Hi all
To read from a large Delta table, I'm using readStream but with a trigger(availableNow=True) as I only want to run it daily. This worked well for an intial load and then incremental loads after that.
At some point though, I received an error from the source Delta table that a parquet file referenced by the index is not available anymore.
I know that a VACUUM command is periodically issued against the source table but with the default of 7 days.
My incremental load was not executed for 2 weeks. Could that be a problem?
How does readStream work exactly: If it ran 2 weeks ago, will it try to read all table versions since then? That could explain the error as it would reference parquet files from > 7 days.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-24-2023 11:14 AM
Thanks a lot for the details. One point I still don't get is the difference between these two points (and let's forget vacuum for this):
- If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.
- Instead, it will process only the new records introduced during that time period.
Let's say my source delta table version is 2500. I execute a streaming job once with availableNow=True. So it loads everything up to table version 2500.
Now for two weeks I insert, delete and update data in this source table. After 2 weeks, I'm at version 2750. Now I execute the streaming job again.
I don't understand the difference: everything between versions 2500 and 2750 is exactly what has changed? Does the second bullet point mean, it only processes inserts but not deletes and updates?
Thanks for clearifying.