cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

synapse pyspark delta lake merge scd type2 without primary key

sunil_ksheersag
New Contributor

Problem
I have a set of rows coming from previous process which has no primary key, and the composite keys are bound to change which are not a good case for composite key, only way the rows are unique is the whole row( including all keys and all values). I need to implement the SCD type2 on this data. The environment is Synapse pyspark, using delta lake Merge command and more.

how I tried
Using row hash: In this case the challenge without primary/composite key is to find which rows have changed/updated. With any updated values the row hash is changing and resulting into new row.

please suggest how this problem can be solved. If you have any questions on this, please write back.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @sunil_ksheersag, Implementing Slowly Changing Dimension (SCD) Type 2 without a primary key can be challenging, but there are alternative approaches you can consider. 

 

Here are some strategies to handle this situation:

 

Surrogate Key Approach:

  • Add a surrogate key to each record. This key is an artificial identifier that assigns a unique number to each row processed.
  • For every new record, insert it with a new surrogate key.
  • When an existing record changes, create a new row with the updated values and a new surrogate key.
  • This approach allows you to track changes over time without relying on a primary key.
  • You can use this surrogate key as a reference for historical data.

Business Key Approach:

  • Identify business keys within your data. These are natural keys that uniquely identify a record based on business rules.
  • Use these business keys to determine whether a record has changed.
  • When a change occurs, insert a new row with the updated values and a new surrogate key.
  • Maintain the history of changes using the business keys.
  • This approach is often used when primary keys are not available.

Hash-Based Approach:

  • Although you mentioned challenges with row hash due to changing values, you can still use a hash-based approach.
  • Calculate a hash (e.g., SHA-256) of the entire row (including all keys and values).
  • Compare the hash of the incoming row with the hash of the existing row.
  • If the hash differs, create a new row with the updated values.
  • This method ensures that any change in the row content is captured, even without a primary key.
  • However, be cautious about hash collisions and ensure the hash function is robust.

Data-Driven Approach:

  • In your ETL process, use a target load plan where the update flow is applied first, followed by the insert operation.
  • This way, you can identify changes and insert new rows accordingly.
  • Set the session task properties to treat the source row as data-driven.
  • This approach doesn’t rely on a primary key but still allows you to manage historical changes.

Delta Lake Merge Command:

  • Since you’re using Delta Lake, leverage the MERGE command.
  • Compare the incoming data with the existing data based on the entire row content.
  • Use the WHEN MATCHED clause to update existing rows and the WHEN NOT MATCHED clause to insert new rows.
  • The absence of a primary key won’t hinder this process.

Remember that while these approaches allow you to handle SCD Type 2 without a primary key, they may impact performance and require careful design. Choose the one that best aligns with your specific use case and data requirements. 

 

If you have any further questions, feel free to ask! 😊

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group