cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Incremental Load Without any keys

Rupa0503
New Contributor II

so am performing incremental load where i want to insert or update data but there are no date columns or any keys how can i do the incremental load from silver to gold layer

4 REPLIES 4

Sumit_7
Esteemed Contributor

@Rupa0503 The correct approach is to have a key identifier to match and merge the records. You could also do a full load (expensive but works). Databricks official recommendation is Delta Change Data Feed.

balajij8
Contributor III

@Rupa0503 

You can follow below

  • Key Hashing & Merge - You can create a synthetic key by concatenating all data columns in a row and applying a hashing algorithm (MD5 or SHA-2) within the Silver layer. The hash acts as a unique fingerprint for that exact snapshot of data. When moving data to Gold, you can use a MERGE operation matching solely on this generated hash. If the hash already exists in Gold, the row can be ignored as a duplicate. If it does not exist, you can treat it as a new entry. If an upstream entity changes, it generates a brand new hash which naturally appends to Gold as a new row version preserving historical states without requiring a traditional update key.
  • Structured Streaming - If your Silver layer is strictly append-only (new data arrives continuously but existing records are never modified or rewritten upstream), you can bypass batch processing entirely in favor of Structured Streaming. Databricks Structured Streaming utilizes internal metadata checkpoints to track exactly which data files have been processed. The engine automatically handles the incremental logic behind the scenes by reading only the new files populated in the Silver table since the last trigger. Because the framework relies on file level tracking rather than data/column level, it successfully pushes incremental updates to Gold without needing a key or timestamp.
  • Change Data Feed (CDF) - If the Silver layer does experience modifications (such as overwrites or merges) and you cannot rely on a simple append stream, you can enable Delta Lake’s Change Data Feed (CDF) on the Silver table. CDF automatically isolates row level mutations by exposing metadata columns (whether a row is a new insertion or the post image of an update). By streaming from the Change Data Feed, you can pull the exact rows that changed during the latest commit minimizing data scanning costs, allowing to feed a set of modified rows into your Gold logic without scanning full data.

pragya17
New Contributor III

@Rupa0503 Watermark doesn't need a date column from silver/gold. You can use pipeline run timestamp as the watermark — tracking when the pipeline last ran, not when data was modified. Also you can use Delta table version history within watermark  as no date column will be needed .

Second , you can use Delta Change Data Feed - best for no date column and row level changes . 

savlahanish27
Databricks Partner

The short answer is yes, but with an important condition.

Delta Lake's versioning and Change Data Feed make keyless incremental loads possible by tracking changes at the storage layer rather than the data layer. Every write to a Delta table increments a version number, and CDF exposes metadata columns - _change_type, _commit_version, and _commit_timestamp - on every row that changed in that commit. You read from the last processed version and get exactly what changed, without needing any business key or date column in your actual data.

Structured Streaming takes this further for append-only silver tables - it tracks which files have already been consumed via a checkpoint directory, so the engine handles the incremental logic entirely at the file level. Again, nothing from the data itself is required.

Where it breaks down is upserts. Merging into gold requires something to match on. Without a key, you cannot reliably identify that a row arriving in silver is an updated version of an existing row in gold - you can only know that something changed, not what it corresponds to. In that case, incremental append is still solvable, but a true upsert leaves you with two real options: synthesise a key by hashing all columns and accept that you are tracking row snapshots rather than entities or fall back to a full reload of the affected partition.

So, the boundary to understand before choosing an approach is this - incremental detection without keys is solved by Delta itself. Incremental merge without keys is a data modelling problem, not a platform problem.