cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Feature request: Allow to set value as null when not present in schema evolution

Dhruv-22
Contributor II

I want to raise a feature request as follows.

Currently, in the Automatic schema evolution for merge when a column is not present in the source dataset it is not changed in the target dataset. For e.g.

%sql
CREATE OR REPLACE TABLE edw_nprd_aen.bronze.test_table (
  id INT
  , assignments ARRAY<STRING>
);

INSERT INTO edw_nprd_aen.bronze.test_table
VALUES 
  ( 1, ARRAY('S1-1', 'S1-2') )
  , (2, NULL)
  , (3, ARRAY('S3-1'))
  , (4, ARRAY('S4-1'));

SELECT *
FROM edw_nprd_aen.bronze.test_table;

Dhruv22_0-1767970990008.png

%sql
CREATE OR REPLACE TEMPORARY VIEW test_view (id) AS VALUES (1), (3);

SELECT *
FROM test_view

Dhruv22_1-1767971051176.png

%sql
MERGE WITH SCHEMA EVOLUTION INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

SELECT *
FROM edw_nprd_aen.bronze.test_table
ORDER BY id

Dhruv22_2-1767971116934.png

---

What I want is that if the row matches and the column is not present then set it to null. Like the following

Dhruv22_3-1767971213212.png

Currently, I'm achieving this as follows

table_cols = spark.table('edw_nprd_aen.bronze.test_table').columns
view_cols = spark.table('test_view').columns

merge_string = f"""
MERGE WITH SCHEMA EVOLUTION INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET
{'\n, '.join([f"{col} = test_view.{col}" if col in view_cols else f"{col} = NULL" for col in table_cols])}
WHEN NOT MATCHED THEN INSERT *
"""

print(merge_string)
spark.sql(merge_string)

Dhruv22_4-1767971292625.png

 

It will be great if there could be some in-built option for the same, like

UPDATE SET * NULL IF NOT PRESENT
0 REPLIES 0