Streaming problems after Vaccum

pgruetter
Contributor

Hi all

To read from a large Delta table, I'm using readStream but with a trigger(availableNow=True) as I only want to run it daily. This worked well for an intial load and then incremental loads after that.

At some point though, I received an error from the source Delta table that a parquet file referenced by the index is not available anymore. 

I know that a VACUUM command is periodically issued against the source table but with the default of 7 days.
My incremental load was not executed for 2 weeks. Could that be a problem?

How does readStream work exactly: If it ran 2 weeks ago, will it try to read all table versions since then? That could explain the error as it would reference parquet files from > 7 days. 

Thanks

Thanks a lot for the details. One point I still don't get is the difference between these two points (and let's forget vacuum for this):

  • If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.
  • Instead, it will process only the new records introduced during that time period.

Let's say my source delta table version is 2500. I execute a streaming job once with availableNow=True. So it loads everything up to table version 2500. 

Now for two weeks I insert, delete and update data in this source table. After 2 weeks, I'm at version 2750. Now I execute the streaming job again.

I don't understand the difference: everything between versions 2500 and 2750 is exactly what has changed? Does the second bullet point mean, it only processes inserts but not deletes and updates?

Thanks for clearifying.