Re: Reduce the Time for First Spark Streaming Run ...

stbjelcevic · ‎11-25-2025

You can make the Gold stream start much faster by avoiding the “full initial snapshot” and bootstrapping from the point your Gold has already processed, plus a few rate limits and features tuned for heavy Delta tables.

When you stream directly from a Delta table, the engine first processes the entire current table state as an initial snapshot before moving to incremental changes, which is expensive on a “heavy” Silver table.

If your stream is stateful and uses a watermark, the default initial-snapshot file order (by last modification time) can also cause extra scanning and even late-data drops without event-time ordering.

Here are a few different things to consider that might help you out:

Stream from Silver’s CDF and start at the current processed version. If your Silver has updates/deletes (not append-only), enable Change Data Feed (CDF) on Silver and read the CDF in Gold, starting from the version your Gold has already processed; this skips the big initial snapshot and only consumes row-level changes.
If Silver is append-only, stream the table itself (skip change commits): For purely append-only sources, you can stream the Delta table directly and set options to ignore non-append commits when they happen. Databricks recommends skipChangeCommits for new workloads.
Start the stream from “latest” (no historical snapshot): If Gold is already up-to-date and you don’t need the initial snapshot, set the Delta source starting point to the latest so you only process future changes.

Here are some good resources to take a look at as well: