cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Best practices to load single delta table in parallel from multiple processes.

Anonymous47
New Contributor II

Hi all,

A delta lake table is created with identity column, and it is not possible to load the data parallelly to this table from multiple process as it leads to MetadataChangedException.

Based on another post from community, we can have try to repeat the write in exception or retry attempts. But for large volume tables taking long to finish, it might still fail even in retry?

Want to understand, 1) What are the best practices that we can implement for this use case or for any parallel writes? When to use row level concurrency? 2) Is there any way to generate sequential number without using identity column as UUID or monotonically_increasing_id() will not provide sequential series. 3) Is there any enhancement in pipeline to introduce sequence equivalent on oracle? Current sequence function in databricks is different.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Anonymous47 , Let’s dive into your questions regarding Delta Lake and parallel writes:

 

Best Practices for Parallel Writes:

  • Partitioning: Choose an appropriate partition column for your Delta table. Typically, the most commonly used partition column is the date. Remember these rules of thumb:
    • Avoid using columns with very high cardinality for partitioning (e.g., a column with 1M distinct user IDs).
    • Partition by a column if you expect data in that partition to be at least 1 GB.
  • Compaction: Regularly compact your Delta table to consolidate small files into larger ones. This improves read efficiency and file system performance. You can use the repartition operation with the dataChange option set to false to compact your table.
  • Concurrency Control: Delta Lake provides ACID transaction guarantees. Multiple writers across multiple clusters can simultaneously modify a table partition while maintaining a consistent snapshot view of the table. Readers see a consistent snapshot even when the table is modified during a job.
  • VACUUM: Use Databricks Runtime 10.4 LTS or above and additional driver cores (Azure and GCP only) fo....

Generating Sequential Numbers Without Identity Columns:

  • If you want to generate sequential numbers without using an identity column, consider using the sequence function in Databricks SQL. It generates an array of elements from a start value to a stop value (inclusive), incrementing by a specified step. For example:

Sequence Equivalent on Oracle:

  • Unfortunately, there isn’t a direct equivalent to the sequence function in Oracle. However, you can achieve similar functionality by using other techniques, such as creating a custom sequence table or using row numbers. Keep in mind that the approach may differ from Databricks’ sequence function.

Remember to adapt these practices based on your specific use case and requirements. If you have further questions or need additional guidance, feel free to ask! 😊

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.