cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

merge breaking persistance of dataframe

FabriceDeseyn
Contributor

Hi all

In the minimal example below you can see that executing a merge statement trigger recomputation of a persisted dataframe. How does this happen? 

 

 

from delta.tables import DeltaTable

table_name = "hive_metastore.default.test_table"

# initialize table
data = [{"column1": 1}]
df = spark.createDataFrame(data)
df.write.saveAsTable(table_name, mode="overwrite")
dt_dummy = DeltaTable.forName(spark, table_name)

# Create new data
new_data = [{"column1": 2}]
df_new = spark.createDataFrame(new_data)

# Identify differences
df_diff = dt_dummy.toDF().join(df_new, "column1", "outer")
df_diff = df_diff.persist()
print(f"we now have {df_diff.count()} records in our data")

# Execute merge
print("executing merge")
merge_results = (
    dt_dummy.alias("deltatable")
    .merge(df_diff.alias("updates"), "updates.column1 != deltatable.column1")
    .whenMatchedUpdateAll()
).execute()
# Identify differences after merge
print(f"we now have {df_diff.count()} records in our persisted data")

# --> recompute of persisted dataframe happened

 

 

FabriceDeseyn_1-1694011507567.png

 

This is run on DBR 13.2 and 13.3 LTS.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @FabriceDeseyn , The recomputation of the persisted dataframe occurs due to the nature of the merge operation in Delta Lake. When a merge operation is executed, it triggers a re-evaluation of the DataFrame, which includes re-computing any persisted DataFrames that are part of the operation. This is because an merge operation needs to ensure it has the most recent data, and thus it cannot rely on a persisted DataFrame that might be stale.The merge operation in Delta Lake follows certain rules for schema validation, and it ensures that the schema of the data generated by insert and update expressions are compatible with the schema of the table. For update and insert actions, the specified target columns must exist in the target Delta table. For updateAll and insertAll actions, the source dataset must have all the columns of the target Delta table. The source dataset can have extra columns and they are ignored. For all actions, if the data type generated by the expressions producing the target columns are different from the corresponding columns in the target Delta table, merge tries to cast them to the types in the table.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.