You can make the Gold stream start much faster by avoiding the โfull initial snapshotโ and bootstrapping from the point your Gold has already processed, plus a few rate limits and features tuned for heavy Delta tables.
When you stream directly from a Delta table, the engine first processes the entire current table state as an initial snapshot before moving to incremental changes, which is expensive on a โheavyโ Silver table.
If your stream is stateful and uses a watermark, the default initial-snapshot file order (by last modification time) can also cause extra scanning and even late-data drops without event-time ordering.
Here are a few different things to consider that might help you out:
- Stream from Silverโs CDF and start at the current processed version. If your Silver has updates/deletes (not append-only), enable Change Data Feed (CDF) on Silver and read the CDF in Gold, starting from the version your Gold has already processed; this skips the big initial snapshot and only consumes row-level changes.
- If Silver is append-only, stream the table itself (skip change commits): For purely append-only sources, you can stream the Delta table directly and set options to ignore non-append commits when they happen. Databricks recommends skipChangeCommits for new workloads.
- Start the stream from โlatestโ (no historical snapshot): If Gold is already up-to-date and you donโt need the initial snapshot, set the Delta source starting point to the latest so you only process future changes.
Here are some good resources to take a look at as well: