Databricks Community

Naveenkumar1811 · ‎11-12-2025

Hi Team,

We have our Bronze(append) Silver(append) and Gold(merge) Tables loaded using spark streaming continuously with trigger as processing time(3 secs).

We Also Run our Maintenance Job on the Table like OPTIMIZE,VACCUM and we perform DELETE for some tables with a datetime retention policy.

In such cases we see that our jobs often fails stating the underlying source file for deleted, or missing or updated...

I want to understand what is the optimized design or approach for my streaming process to perform this kind of Maintenance without affecting my streaming.

Thanks,

Naveen

mark_ott · ‎11-13-2025

To ensure continuous Spark streaming to your Delta Lake Bronze, Silver, and Gold tables while performing maintenance jobs (like OPTIMIZE, VACUUM, and DELETE) without streaming job failures, you need to coordinate and optimize your maintenance strategies. Failures such as "underlying source file deleted or missing" are common when streaming jobs encounter files that maintenance jobs have removed or altered.

Recommended Strategies

Increase Streaming Trigger Interval

Increase the batch trigger interval for streaming jobs (from 3 seconds to something larger, like 2 minutes or more), especially on high-throughput tables. This reduces the contention between streaming write/read operations and ongoing maintenance tasks.

Avoid Overlap Between Streaming and Maintenance

Do not schedule maintenance jobs like DELETE, OPTIMIZE, or VACUUM at the same time as streaming workloads on the same table. Maintenance tasks should run during planned low-traffic windows when streaming is paused, or be orchestrated to avoid overlap with streaming activity.
Maintenance should be orchestrated using workflows (such as job orchestration tools, or orchestration scripts) to ensure streaming jobs are paused, maintenance completes, then streaming resumes.

Retention and Vacuum Settings

Do not set a VACUUM retention period lower than the time in which streaming jobs may touch old files. Default is 7 days; setting it lower (e.g., 1 hour) risks removal of files still in use by streaming queries, leading to failures.
If retention must be lowered, set spark.databricks.delta.retentionDurationCheck.enabled to false, but this is generally discouraged unless you're confident all consumers will not read recently deleted files.

Use AutoOptimize and AutoCompaction

Enable Delta Lake features like autoOptimize and autoCompaction to minimize small files and keep tables performant, which improves both streaming and batch consumption and reduces maintenance frequency.

Cluster and Job Management

Use separate job clusters for maintenance work (OPTIMIZE, VACUUM, DELETE), ideally autoscaling to fit resource needs so streaming jobs remain unaffected by resource contention.
If required, use larger driver nodes and more workers to parallelize heavy maintenance jobs, then scale down post-job completion for efficiency.

Streaming-Specific Options

For streaming jobs, make use of foreachBatch to run periodic maintenance operations safely at batch boundaries instead of concurrently with streaming jobs.
Always checkpoint streaming queries. If schema changes, or VACUUM deletes files, restarting the query with a new checkpoint may be necessary.

Best Practice Summary

Schedule and orchestrate maintenance and streaming to run in non-overlapping windows.
Never run maintenance with retention periods shorter than the possible read lag of any streaming consumer.
Increase streaming batch triggers, use autoOptimize/autoCompaction, and monitor for small file accumulation.
Automate maintenance via orchestration tools/scripts to avoid manual errors.
Consider using foreachBatch in streaming jobs to embed some maintenance logic with awareness of streaming state.

By applying these best practices, you reduce maintenance-related stream failures and keep your pipelines resilient and performant.

Naveenkumar1811 · ‎11-13-2025

Hi Mark,

But the real problem is our streaming job runs 365 days 24 *7 and we cant afford any further latency to our data flowing to gold layer. We don't have any window to pause or slower our streaming and we continuously get the data feed actually since the data is flowing from a satellite terminal. I see the suggestion mentioned above mostly to control our streaming which is not possible in our case? Any other best design approach for a live application perspective?