stbjelcevic
Databricks Employee
Databricks Employee

You can make the Gold stream start much faster by avoiding the “full initial snapshot” and bootstrapping from the point your Gold has already processed, plus a few rate limits and features tuned for heavy Delta tables.

When you stream directly from a Delta table, the engine first processes the entire current table state as an initial snapshot before moving to incremental changes, which is expensive on a “heavy” Silver table.

If your stream is stateful and uses a watermark, the default initial-snapshot file order (by last modification time) can also cause extra scanning and even late-data drops without event-time ordering.

Here are a few different things to consider that might help you out:

  • Stream from Silver’s CDF and start at the current processed version. If your Silver has updates/deletes (not append-only), enable Change Data Feed (CDF) on Silver and read the CDF in Gold, starting from the version your Gold has already processed; this skips the big initial snapshot and only consumes row-level changes.
  • If Silver is append-only, stream the table itself (skip change commits): For purely append-only sources, you can stream the Delta table directly and set options to ignore non-append commits when they happen. Databricks recommends skipChangeCommits for new workloads.
  • Start the stream from “latest” (no historical snapshot): If Gold is already up-to-date and you don’t need the initial snapshot, set the Delta source starting point to the latest so you only process future changes.

Here are some good resources to take a look at as well: