cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Incremental Load Without any keys

Rupa0503
New Contributor II

so am performing incremental load where i want to insert or update data but there are no date columns or any keys how can i do the incremental load from silver to gold layer

3 REPLIES 3

Sumit_7
Esteemed Contributor

@Rupa0503 The correct approach is to have a key identifier to match and merge the records. You could also do a full load (expensive but works). Databricks official recommendation is Delta Change Data Feed.

balajij8
Contributor III

@Rupa0503 

You can follow below

  • Key Hashing & Merge - You can create a synthetic key by concatenating all data columns in a row and applying a hashing algorithm (MD5 or SHA-2) within the Silver layer. The hash acts as a unique fingerprint for that exact snapshot of data. When moving data to Gold, you can use a MERGE operation matching solely on this generated hash. If the hash already exists in Gold, the row can be ignored as a duplicate. If it does not exist, you can treat it as a new entry. If an upstream entity changes, it generates a brand new hash which naturally appends to Gold as a new row version preserving historical states without requiring a traditional update key.
  • Structured Streaming - If your Silver layer is strictly append-only (new data arrives continuously but existing records are never modified or rewritten upstream), you can bypass batch processing entirely in favor of Structured Streaming. Databricks Structured Streaming utilizes internal metadata checkpoints to track exactly which data files have been processed. The engine automatically handles the incremental logic behind the scenes by reading only the new files populated in the Silver table since the last trigger. Because the framework relies on file level tracking rather than data/column level, it successfully pushes incremental updates to Gold without needing a key or timestamp.
  • Change Data Feed (CDF) - If the Silver layer does experience modifications (such as overwrites or merges) and you cannot rely on a simple append stream, you can enable Delta Lake’s Change Data Feed (CDF) on the Silver table. CDF automatically isolates row level mutations by exposing metadata columns (whether a row is a new insertion or the post image of an update). By streaming from the Change Data Feed, you can pull the exact rows that changed during the latest commit minimizing data scanning costs, allowing to feed a set of modified rows into your Gold logic without scanning full data.

pragya17
New Contributor II

@Rupa0503 Watermark doesn't need a date column from silver/gold. You can use pipeline run timestamp as the watermark — tracking when the pipeline last ran, not when data was modified. Also you can use Delta table version history within watermark  as no date column will be needed .

Second , you can use Delta Change Data Feed - best for no date column and row level changes .