cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Streaming problems after Vaccum

pgruetter
Contributor

Hi all

To read from a large Delta table, I'm using readStream but with a trigger(availableNow=True) as I only want to run it daily. This worked well for an intial load and then incremental loads after that.

At some point though, I received an error from the source Delta table that a parquet file referenced by the index is not available anymore. 

I know that a VACUUM command is periodically issued against the source table but with the default of 7 days.
My incremental load was not executed for 2 weeks. Could that be a problem?

How does readStream work exactly: If it ran 2 weeks ago, will it try to read all table versions since then? That could explain the error as it would reference parquet files from > 7 days. 

Thanks

1 REPLY 1

Thanks a lot for the details. One point I still don't get is the difference between these two points (and let's forget vacuum for this):

  • If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.
  • Instead, it will process only the new records introduced during that time period.

Let's say my source delta table version is 2500. I execute a streaming job once with availableNow=True. So it loads everything up to table version 2500. 

Now for two weeks I insert, delete and update data in this source table. After 2 weeks, I'm at version 2750. Now I execute the streaming job again.

I don't understand the difference: everything between versions 2500 and 2750 is exactly what has changed? Does the second bullet point mean, it only processes inserts but not deletes and updates?

Thanks for clearifying.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group