Databricks Community

Akshay_Petkar · ‎11-15-2024

Can anyone provide a sample MERGE INTO SQL query for implementing SCD Type 2 in Databricks using Delta Tables?

Akshay Petkar

jeffreyaven · ‎10-29-2025

Here is a simple example using an upstream Delta table with ChangeDataFeed enabled, using table_changes() to get the records with their corresponding operation, this is a 2 step process

you need to close out modified or deleted records
add new rows (inserted at the source)

-- Step 1: Close out records that changed (updates and deletes)

MERGE INTO west_division.retail_data.customers_type2 AS target
USING (
  SELECT DISTINCT customer_id, _commit_timestamp
  FROM table_changes('east_division_shared.retail.customers', 2, 5)
  WHERE _change_type IN ('update_postimage', 'delete')
  ORDER BY _commit_timestamp
) AS source
ON target.customer_id = source.customer_id AND target.is_current = true
WHEN MATCHED THEN
  UPDATE SET
    end_date = source._commit_timestamp,
    is_current = false;

-- Step 2: Insert new versions (inserts and updates)
INSERT INTO west_division.retail_data.customers_type2
SELECT
  customer_id, customer_name, email, country, signup_date, customer_segment,
  _commit_timestamp as start_date,
  NULL as end_date,
  true as is_current
FROM table_changes('east_division_shared.retail.customers', 2, 5)
WHERE _change_type IN ('insert', 'update_postimage')
ORDER BY _commit_timestamp;

View solution in original post

David_Torrejon · ‎11-15-2024

Here an example for a customer table.

The source_table contains new or updated customer data, and the target_table is the Delta table that maintains historical records.

Table Structures

source_table: contains the latest customer data.

customer_id: Unique identifier for the customer.

name: Customer's name.

address: Customer's address.

email: Customer's email.

phone: Customer's phone number.

target_table: contains the historical customer data.

customer_id: Unique identifier for the customer.

name: Customer's name.

address: Customer's address.

email: Customer's email.

phone: Customer's phone number.

valid_from: Date when the record became effective.

valid_to: Date until the record is effective.

is_current: Flag indicating the current active record.

hash_value: Hash of the attributes to detect changes.

WITH source_with_hash AS (

SELECT

customer_id,

name,

address,

email,

phone,

md5(concat_ws('|', name, address, email, phone)) AS hash_value

FROM source_table

)

MERGE INTO target_table AS target

USING source_with_hash AS source

ON target.customer_id = source.customer_id

AND target.is_current = true

WHEN MATCHED AND target.hash_value != source.hash_value THEN

UPDATE SET

target.valid_to = current_date - 1,

target.is_current = false

WHEN NOT MATCHED BY TARGET THEN

INSERT (customer_id, name, address, email, phone, valid_from, valid_to, is_current, hash_value)

VALUES (source.customer_id, source.name, source.address, source.email, source.phone, current_date, '9999-12-31', true, source.hash_value)

WHEN NOT MATCHED BY SOURCE AND target.is_current = true THEN

UPDATE SET

target.valid_to = current_date - 1,

target.is_current = false;

Here the explanation about all parts of the sentence.

WITH Clause:

Creates a subquery source_with_hash that adds a hash_value column to the source_table. This column contains an MD5 hash of the relevant attributes to detect changes.

MATCHED Clause:

Handles updates where there are changes in the source data (source.hash_value is different from target.hash_value).

Updates the valid_to date of the current record in the target table to the previous day and sets is_current to false.

NOT MATCHED BY TARGET Clause:

Inserts new records that do not exist in the target table.

Inserts the new records with valid_from set to the current date, valid_to set to '9999-12-31', and is_current set to true.

NOT MATCHED BY SOURCE Clause:

Handles records that are in the target table but not in the source table (optional, if you want to handle deletions).

Updates the valid_to date to the previous day and sets is_current to false.

You only have to adjust the column names and logic according to your specific schema and requirements.

I hope it helps you.

dbarua · ‎11-26-2024

Is there any limitation to the length of the string passed to md5 function when concatenating multiple columns to generate hash_value field ?

David_Torrejon · ‎11-15-2024

also, in PySpark, the same example in pyspark:

from pyspark.sql.functions import col, concat_ws, current_date, lit, md5

source_df = spark.table("source_table")
target_df = spark.table("target_table")

source_with_hash_df = source_df.withColumn("hash_value", md5(concat_ws("|", col("name"), col("address"), col("email"), col("phone"))))

target_df.alias("target").merge(
source_with_hash_df.alias("source"),
"target.customer_id = source.customer_id AND target.is_current = true"
).whenMatchedUpdate(
condition="target.hash_value != source.hash_value",
set={
"valid_to": current_date() - 1,
"is_current": lit(False)
}
).whenNotMatchedInsert(
values={
"customer_id": col("source.customer_id"),
"name": col("source.name"),
"address": col("source.address"),
"email": col("source.email"),
"phone": col("source.phone"),
"valid_from": current_date(),
"valid_to": lit("9999-12-31"),
"is_current": lit(True),
"hash_value": col("source.hash_value")
}
).whenNotMatchedBySourceUpdate(
condition="target.is_current = true",
set={
"valid_to": current_date() - 1,
"is_current": lit(False)
}
)

You have to add an action to execute.

Prabhuram · ‎01-06-2025

Hi @David_Torrejon

Doesn't you example code perform SCD Type 1 rather than Type 2?

whenMatchedUpdate() updates an existing record.

whenNotMatchedInsert() inserts new records.

whenNotMatchedBySourceUpdate() updates records not available in the source

In SCD Type 2, when an old record is updated, a corresponding new row needs to be inserted with is_current as 'true'. Where is this happening?

svrijssel · ‎01-07-2025

Yep, I was thinking the same. The only way I know is to have a seperated INSERT INTO command before the MERGE INTO.

INSERT INTO target_table (
     columns,
     effectiveStartDate,
     effectiveEndDate,
     isCurrent,
     version
)
SELECT
     new.columns,
     DATE(new.timestamp),
     DATE('9999-12-31'),
     TRUE,
     target.version + 1
FROM df as new
LEFT JOIN destination_table as target
ON new.customerId = target.customerId and target.isCurrent
WHERE (
  target.column <> new.column
  OR target.column <> new.column
)

jbhavesh · ‎10-27-2025

Same Concern where is this happening do we have any other example where its handling it correctly by maintaining history

JissMathew · ‎11-25-2024

Hi @Akshay_Petkar , please refer this code ,

df = spark.read.format("delta").load(f"{bronze_folder_path}/Test_new")

Table Structure

%sql

CREATE TABLE IF NOT EXISTS test_project_ws.demo.Test_merge (

ID INT,

Name STRING ,

Address STRING,

date DATE,

createdDate TIMESTAMP,

updatedDate TIMESTAMP

)

USING DELTA

from pyspark.sql.functions import current_timestamp

from delta.tables import DeltaTable

table_name = "test_project_ws.demo.Test_merge"

deltaTable = DeltaTable.forName(spark, table_name)

deltaTable.alias("tgt").merge(

bronze_df.alias("upd"),

"tgt.Id = upd.Id"

).whenMatchedUpdate(

set={

"Id": "upd.Id",

"Name": "upd.Name",

"Address": "upd.Address",

"date": "upd.date",

"updatedDate": "current_timestamp()"

}

).whenNotMatchedInsert(

values={

"Id": "upd.Id",

"Name": "upd.Name",

"Address": "upd.Address",

"date": "upd.date",

"createdDate": "current_timestamp()"

}

).execute()

Jiss Mathew
India .

bhanu_gautam · ‎11-28-2024

@JissMathew and @David_Torrejon , Thanks for sharing the example

Regards
Bhanu Gautam

Kudos are appreciated

jeffreyaven · ‎10-29-2025

Here is a simple example using an upstream Delta table with ChangeDataFeed enabled, using table_changes() to get the records with their corresponding operation, this is a 2 step process

you need to close out modified or deleted records
add new rows (inserted at the source)

-- Step 1: Close out records that changed (updates and deletes)

MERGE INTO west_division.retail_data.customers_type2 AS target
USING (
  SELECT DISTINCT customer_id, _commit_timestamp
  FROM table_changes('east_division_shared.retail.customers', 2, 5)
  WHERE _change_type IN ('update_postimage', 'delete')
  ORDER BY _commit_timestamp
) AS source
ON target.customer_id = source.customer_id AND target.is_current = true
WHEN MATCHED THEN
  UPDATE SET
    end_date = source._commit_timestamp,
    is_current = false;

-- Step 2: Insert new versions (inserts and updates)
INSERT INTO west_division.retail_data.customers_type2
SELECT
  customer_id, customer_name, email, country, signup_date, customer_segment,
  _commit_timestamp as start_date,
  NULL as end_date,
  true as is_current
FROM table_changes('east_division_shared.retail.customers', 2, 5)
WHERE _change_type IN ('insert', 'update_postimage')
ORDER BY _commit_timestamp;

Databricks Community

Need a Sample MERGE INTO Query for SCD Type 2 Implementation

Join Us as a Local Community Builder!

🎬 Databricks Community 2025 Highlights | A Year, Built Together

🌟 Community Pulse: Your Weekly Roundup! December 22, 2025 – January 04, 2026

Solution Accelerator Series | Scale cybersecurity analytics with Splunk and Databricks

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Self-Paced Learning Festival: 09 January - 30 January 2026