Best practices to load single delta table in parallel from multiple processes.

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hi all,

A delta lake table is created with identity column, and it is not possible to load the data parallelly to this table from multiple process as it leads to MetadataChangedException.

Based on another post from community, we can have try to repeat the write in exception or retry attempts. But for large volume tables taking long to finish, it might still fail even in retry?

Want to understand, 1) What are the best practices that we can implement for this use case or for any parallel writes? When to use row level concurrency? 2) Is there any way to generate sequential number without using identity column as UUID or monotonically_increasing_id() will not provide sequential series. 3) Is there any enhancement in pipeline to introduce sequence equivalent on oracle? Current sequence function in databricks is different.