cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Are MERGE INTO inserts supported when the delta table has an identity column ?

VaDim
New Contributor III

I can't seem to make it work as I keep getting:

DeltaInvariantViolationException: NOT NULL constraint violated for column: dl_id.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

I was just typing this workaround ๐Ÿ™‚ Because the id col is defined as GENERATED ALWAYS AS IDENTITY it can never be updated. If you then use UpdateAll it will generate an error.

remove the Id col from the columns to be updated.

This is the way.

View solution in original post

8 REPLIES 8

-werners-
Esteemed Contributor III

I would think it is supported as that is the whole purpose of the ID column, not providing the value yourself.

I haven't tested it though.

But what version of databricks do you use? You need a pretty recent version (10.4+).

Also: how is the identity column created?

VaDim
New Contributor III

I use 11.2 runtime. identity column created as:

dl_id   BIGINT NOT NULL GENERATED ALWAYS AS IDENTITY

and the pyspark code is an upsert:

t.alias("ec").merge(df.alias("uc"), "ec.dl_id = uc.dl_id") \
            .whenMatchedUpdateAll() \
            .whenNotMatchedInsertAll() \
            .execute()

-werners-
Esteemed Contributor III

Hm, and both ec and uc have id's always filled in?

VaDim
New Contributor III

`ec` - always. `uc` - not. because `uc` might contain new data that needs appended.

-werners-
Esteemed Contributor III

I think that is the issue. uc contains a column which is supposed to be an id, but you cannot pass it to the merge as it is generated.

What happens if you merge on the natural key?

VaDim
New Contributor III

I haven't tried but I suspect it will fail with the same message on INSERT because uc.dl_id is NULL for some rows and `whenNotMatchedInsertAll` will attempt to insert a value for dl_id field instead of generating one (as if it has been user provided).

In the meantime I found a workaround: explicitly set the column mapping and do not include one for `dl_id`.

cols2update is a dict with column mapping except dl_id, it works.

t.alias("ec").merge(df.alias("uc"), "ec.dl_id = uc.dl_id") \
            .whenMatchedUpdate(set=cols2update) \
            .whenNotMatchedInsert(values=cols2update) \
            .execute()

-werners-
Esteemed Contributor III

I was just typing this workaround ๐Ÿ™‚ Because the id col is defined as GENERATED ALWAYS AS IDENTITY it can never be updated. If you then use UpdateAll it will generate an error.

remove the Id col from the columns to be updated.

This is the way.

byrdman
New Contributor III

if you are using 'delta.columnMapping.mode' = 'name' on your table i could not get it to work, without that line .. for the not matched .. WHEN NOT MATCHED 

 THEN INSERT (columnname,columnName2) values(columnname,columnName2)

WHEN MATCHED 

Then UPDATE SET (target.columnname = source.columnname,target.columnname2 = source.columnname2) That worked for me.. Leave the id column out of any of these

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group