Databricks Community

dpc · ‎07-25-2024

Hello

I have a table that has a column defined as an identity (BIGINT GENERATED ALWAYS AS IDENTITY)

I will be inserting rows into this table in parallel

How can I get the identity and use that within a pipeline

Parallel is relevant as there will be multiple inserts passed to multiple next steps at the same time

e.g. workflow (multiple streams in parallel): Task 1 - insert a row; Task 2 - Insert rows into another table including the identity value from task 1; Task 3 - Insert rows into another table including the identity value from task 1 etc.

In SQL Server, I would just insert a row and return @@identity

Then just pass this around using stored procedure(s)

Thanks

szymon_dybczak · ‎07-25-2024

Hi @dpc ,

What you're trying to achieve does not make sense in the context of identity columns. Look at below entry from documentation. So, the answer is - if you want to have concurrent transaction, don't use identity columns 🙂

Declaring an identity column on a Delta table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not required.

dpc · ‎07-25-2024

Thanks Slash

In this case though, the batch generation is not concurrent, it's sequential but the full batch running can be concurrent (if that makes sense)

So, I could be running 5 batches in parallel (not necessarily starting at the same time) and all 5 generate a different id.

The batches can differ in terms of what they do but the key here is that, where required, they record the batch id that's relevant to their batch - so the id is recorded consistently (any table writes where needed) throughout

There are some suggestions elsewhere that you can generate one, then just read the last batch id

That wouldn't work here